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fiTHGLg POLYPEPTIDE CHAIB BIHICTG MOIJCULES 



BACKGROUND OF THE IHVgBITIOW 

* This application is a continuation-in-part of 
AppxicaUws Serial Mo- 902,971, filed September 2, 
1986, the contents of which are herein fully incorpor- 
ated by reference. 

Field of the Invention 

The present invention relates to single polypep- 
tide chain binding molecules having the three dimen- 
sional folding, and thus the binding ability and spe- 
cificity, of the variable region of an antibody. 
Methods of producing these molecules by genetic engin- 
eering are also disclosed. 

Description of the Background Art 

The advent of modern molecular biology and immuno- 
logy has brought about the possibility of producing 
large quantities of biologically active materials in 
highly reproduceable form and with low cost. Briefly, 
the gene sequence coding for a desired natural protein 
is isolated, replicated (cloned) and introduced into a 
foreign host such as a hacterium, a yeast (or other 
fungi) or a mammalian cell line in culture, with ap- 

* propriate regulatory control signals. When the sig- 
nals are activated, the gene is transcribed and trans- 

* lated, and expresses the desired protein. In this 
manner, such useful biologically active materials as 
hormones, enzymes or antibodies have been cloned and 
expressed in foreign hosts. 
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One of the problems with this approach is that it 
is limited by the "one gene, one polypeptide chain- 
principle of molecular biology. In other words, a 
genetic sequence codes for a single polypeptide chain. 
Many biologically active polypeptides, however, are 
aggregates of two or more chains. For example, anti- 
bodies are three-dimensional aggregates of two heavy 
and two light chains. In the same manner, large en- 
zymes such as aspartate transcardamylase, fur «xa=plc, 
are aggregates of six catalytic and six regulatory 
chains, these chains being different. In order to 
produce such complex materials by recombinant DNA 
technology in foreign hosts, it becomes necessary to 
clone and express a gene coding for each one of the 
different kinds of polypeptide chains. These genes 
can be expressed in separate hosts. The resulting 
polypeptide chains from each host would then have to 
be reaggregated and allowed to refold together in so- 
lution. Alternatively, the two or more genes coding 
for the two or more polypeptide chains of tne aggre- 
gate could be expressed in the same host simultaneous- 
ly, so that retolding and reassociation into the na- 
tive structure with biological activity will occur 
after expression. The approach, however, necessitates 
expression of multiple genes, and as indicated, in 
some cases, in multiple and different hosts. These 
approaches have proved to be inefficient. 

Even if the two or more genes are expressed in the 
same organism it is quite difficult to get them all 
expressed in the required amounts. 

A classical example of multigene expression to 
form mul timer ic polypeptides is the expression by re- 
combinant DNA technology of antibodies. Genes for 
heavy and light chains have been introduced into ap- 
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propriate hosts and expressed, followed by reaggrega- 
tion of these individual chains into functional anti- 
body molecules (see for example Munro, Nature, 312:597 
(1984); Morrison, S.L. Science 229:1202 (1985); Oi et_ 
al. f BioTechnioues 4:214 (1986)); Wood et_al. , Nature, 

314: 446-449 (1985)). 

Antibody molecules have two generally recognised 
regioua, is ssch of the heavy and light chains. These 
regions are the so-called -variable" region which is 
responsible for binding to the specific antigen in 
question, and the so-called "constant" region which is 
responsible for biological effector responses such as 
complement binding, etc. The constant regions are not 
necessary for antigen binding. The constant regions 
have been separated from the antibody molecule, and 
biologically active (i.e. binding) variable regions 
have been obtained. 

The variable regions of an antibody are composed 
of a light chain and s he-vy chain. Light and heavy 
chain variable regions have been cloned and expressed 
in foreign hosts, and maintain their binding ability 
(Moore et al , European Patent Publication 0088994 
(published September 21, 1983)). 

Further, it is by now well established that all 
antibodies- of a certain class and their Fab fragments 
whose structures have been determined by X-ray crys- 
tallography, even when from different species, show 
closely similar variable regions despite large differ- 
ences in the hypervariable segments. The immunoglo- 
bulin variable region seems to be tolerant toward 
mutations in the combining loops. Therefore, other 
than in the h«pervariable regions, most of the so 
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called "variable" regions of antibodies, which are 
defined by both heavy and light chains f are in fact 
quite constant in their three dimensional arrangement. 
See, for example f Huber, R. , "Structural Basis for 
Antigen-Antibody Recognition, " Science, 233 : 702-703 
(1986). 

It would be very efficient if one could produce 
-ingle polvpeptide-chain molecules which have the same 
biological activity as the multiple chain aggregates 
such as, for example, multiple chain antibody aggre- 
gates or enzyme aggregates- Given the "one gene-one- 
polypeptide chain" principle, such single chain mole- 
cules would be mora readily produceable, and would not 
necessitate multiple hosts or multiple genes in the 
cloning and expression. In order to accomplish this, 
it is first necessary to devise a method for generat- 
ing single chain structures from two-chain aggregate 
structures, wherein the single chain will retain the 
three-dimensional folding of the separate natural ag- 
gregate of two polypeptide chains. 

While the art has discussed the study of proteins 
in three dimensions, and has suggested modifying their 
architecture (see, for example, the article "Protein 
Architecture: Designing . from the Ground Up," by Van 
Brunt, J., BioTechnoloqy , 4: 277-283 (April, 1986)), 
the problem of generating single chain structures from 
multiple chain structures, wherein the single chain 
structure will retain the three-dimensional architec- 
ture of the multiple chain aggregate, has not been 
satisfactorily addressed. 

Given that methods for the preparation of genetic 
sequences, their replication, their Uniting to expres- 
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sion control regions, formation of vectors therewith 
and transformation of appropriate hosts are well un- 
derstood techniques, it would indeed be greatly ad- 
vantageous to be able to produce, by genetic engine- 
ering, single polypeptide chain binding proteins hav- 
ing the characteristics and binding ability of multi 
chain variable regions of antibody molecules. 

cjlMORY Of TBB I1VCTTIOM 

The present invention starts with a computer based 
system and method to determine chemical structures for 
converting two naturally aggregated but chemically 
separated light and heavy polypeptide chains from an 
antibody variable region into a single polypeptide 
chain which will fold into a three dimensional struc- 
ture very similar to the original structure made of. 
the two polypeptide chain's. 

The single polypeptide chain obtained from this 
method can then be used to prepare a genetic sequence 
coding therefor. The genetic sequence can ths= b» 
replicated in appropriate hosts, further linked to 
control regions, and transformed into expression 
hosts, wherein it can be expressed. The resulting 
single polypeptide chain binding protein, upon refold- 
ing, has the binding characteristics of the aggregate 
of the original two (heavy and light) polypeptide 
chains of the variable region of the antibody. 

The invention therefore comprises; 

A single polypeptide chain binding molecule which 
has binding specificity substantially similar to the 
binding specificity of the light and heavy chain ag- 
gregate variable region of an antibody. 



WO 88/01649 



PCT/US87/02208 



-6- 

The invention also comprises genetic sequences 
coding for the above mentioned single polypeptide 
chain, cloning and expression vectors containing such 
genetic sequences, hosts transformed with such vec- 
tors, and methods of production of such polypeptides 
by expression of the underlying genetic sequences in 
such hosts* 

The invention aiau «tsnds tc u«« *nr the binding 
proteins, including uses in diagnostics, therapy, in 
vivo and in vitro imaging, purifications, and biosen- 
sors. The invention also extends to the single chain 
binding molecules in immobilized form, or in detect- 
ably labelled forms for utilization in the above men- 
tioned diagnostic, imaging, purification or biosensor 
applications • It also extends to conjugates of the 
single polypeptide chain binding molecules with thera- 
puetic agents such as drugs or specific toxins ,5 for 
delivery to a specific site in an animal, such as a 
human patient. 

Essentially all of the uses that the prior art has 
envisioned for monoclonal or polyclonal antibodies, or 
for variable region fragments thereof, can be con- 
sidered for the molecules of the present invention. 

The advantages of single chain over conventional 
antibodies are smaller size, greater stability and 
significantly reduced cost. The smaller size of sin- 
gle chain antibodies may reduce the body's immunologic 
reaction and thus increase the safety and efficacy of 
therapeutic applications. Conversely, the single 
chain antibodies could be engineered to be highly an- 
tigenic. The increased stability and lower cost per- 
mits greater use in biosensors and protein purifica- 
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tion systems. Because it is a smaller and simpler 
protein, the single chain antibody is easier to far- 
ther modify by protein engineering so .as to improve 
both its binding affinity and its specificity. Im- 
proved affinity will increase the sensitivity of diag- 
nosis and detection and detection systems while im- 
proved specificity will reduce the number of false 
positive? observed. 

BRIEF DBSCRIPTIOH OF nK PRAWIBGS 

The present invention as defined in the claims can 
be better understood with reference to the text and to 
the following drawings, as follows: 

Figure 1 is a block diagram of the hardware as- 
pects of the serial processor mode of the present in- 
vention. 

Figure 2 is a block diagram of an^alternate embod- 
iment of the hardware aspects of thl present inven- 

tiufi. 

Figure 3 is a block diagram of the three general 
steps of the present invention. 

Figure 4 is a block diagram of the steps in the 
site selection step in the single linker embodiment. 

Figure 5A is a schematic two dimensional simplifi- 
ed representation of the light chain L and heavy chain 
H of two naturally aggregated antibody variable region 
P v polypeptide chains used to illustrate the site sel- 
ection process. 

Figure 5B is a two dimensional representation of 
the three dimensional relationship of the two aggre- 
gated polypeptide chains showing the light chain L 

( ) and the heavy chain H (-) of the variable 

region of one antibody. 
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Figure 6A is a simplified two dimensional sche- 
matic diagram of the two polypeptide chains showing 
the location of the residue Tau 1 and the residue Sig- 
ma 1* 

Figure 6B is a two dimensional representation of 
the actual relationship of the two polypeptide chains 
showing the residue Tau 1 and the residue Sigma 1. 

Figure 7 shows in very simplified septic way 
the concept of the direction linkers that are possible 
between the various possible sites on the light chain 
L and the heavy chain H in the residue Tau 1 and resi- 
due Sigma 1 respectively. 

Figure Sh is a two dimensional simplified sche- 
matic diagram of a single chain antibody li**Jng to- 
gether two separate chains (( HeavY > and (_£ J) by 
linker 1 ( ) to produce a single chain antibody* 

Figure 8B is a two dimensional representation 
showing a single chain antibody produced by linking 
two aggregated polypeptide chains using iiuk«r 1. 

Figure 9 shows a block diagram of candidate selec- 
tion for correct span. 

Figure 10 shows a block diagram of candidate sel- 
ection for correct direction from N terminal to C ter- 
minal. 

Figure 11 shows a comparison of direction of a gap 
to direction of a candidate. 

Figure 12 shows a block diagram of candidate sel- 
ection for correct orientation at both ends. 

Figure 13 shows a block diagram of selection of 
sites for the two-linker embodiment. 

Figure 14 shows examples of rules by which candi- 
dates may be ranked. 
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Figure ISA shows a two-dimensional simplified re- 
presentation of the variable domain of an Pv light 
chain, L, and the variable domain of an Fv heavy 
chain, H, showing the first two sites to be linked. 

Figure 15B shows a two-dimensional representation 
of the three-dimensional relationships between the 
variable domain of an Fv light chain, L, and the vari- 
able de=ain «* an Fv heavy chain, H, showing the re- 
gions in which the second sites to be linked can be 
found and the linker between the first pair of sites. 

Pigure 16A shows the two-dimensional simplified 
representation of the variable domain of an Fv light 
chain, L, and the variable domain of an Fv heavy 
chain, H, showing the regions in which the second 
sites to be linked can be found and the linker between 
the first pair of sites. 

Figure 16B shows the two-dimensional representa- 
tion of the Ihree-dimensional relationships between 
the v«idbls deaaia of »" li"ht chain, L, and the 
variable domain of an Fv heavy chain, H, showing the 
regions in which the second sites to be linked can be 
found and the linker between the first pair of sites. 

Figure 17A shows the two-dimensional simplified 
representation of the variable domain of an Fv light 
chain, L, and the variable domain of an Fv heavy 
chain, H, showing the second linker and the portions 
of the native protein which are lost. 

Figure 17B shows the two-dimensional representa- 
tion of the three-dimensional relationships between 
the variable domain of an Fv light chain, L, and the 
variable domain of an Fv heavy chain, H, showing the 
second linker and the portions of native protein which 
are lost. 



WO 88/01649 



PCT/US87/02208 



-10- 



Pigure 18 shows the two-dimensional, simplified 
representation of the variable domain of an Pv light 
chain, L, and the variable domain of an Pv heavy 
chain, H, showing the complete construction. 

Figure 19 shows a block diagram of the parallel 
processing mode of the present invention. 

Figure 20A shows five pieces of molecular struc- 
ture. The uppermost segment consists «f two peptide* 
joined by a long line. The separation between the 
peptides is 12.7 *. The first C of each peptide 
lies on the X-axis. The two dots indicate the stan- 
dard reference point in each peptide. 

Below the gap are four linker candidates (labeled 
1,2,3 6 4), represented by a line joining the alpha 
carbons. In all cases, the first and penultimate al- 
pha carbons are on lines parallel to the X-axis, 
spaced 8.0 A apart. Note that the space between dots 
in linker 1 is much shorter than in the gap. 

Pigure 20B shows the initial peptides o£ linkers 
2, 3, and 4 which have been aligned with the first 
peptide of the gap. For clarity, the linkers have 
been translated vertically to their original posi- 
tions. 

The vector from the first peptide in the gap to 
the second peptide in the gap lies along the X-axis, a 
corresponding vector for linkers 3 and 4 also lies 
along the X-axis. Linker 2, however, has this vector 
pointing up and to the right, thus linker 2 is rejec- 
ted. 

Figure 20C shows the ten atoms which compose the 
initial and final peptides of linkers 3 and 4, which 
have been least-squares fit to the corresponding atoms 
from the gap. These peptides have been drawn in. 
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Hote that in the gap and in linker 4 the final peptide 
points down and lies more-or-less in the plane of the 
paper. In linker 3, however, this final pep- 
tide points down and to the left and is twisted about 
90 degrees so that the carbonyl oxygen points toward 
the viewer. Thus linker 3 is rejected. 

Sections B and C are stereo diagrams which may be 
viewed with the standard stereo yiawsr provided. 

Pigure 21 shows the nucleotide sequence and trans- 
lation of the sequence for the heavy chain of a mouse 
anti bovine growth hormone (BGH) monoclonal antibody. 

Figure 22 shows the nucleotide sequence and trans- 
lation of the sequence for the light chain of the same 
monoclonal antibody as that shown in Pigure 21. 

Pigure 23 is a plasmid restriction map contain- 
ing the variable heavy chain sequence (pGX3772) afjd 
that containing the variable light sequence ( P GX3773) 
shown in figures 21 and 22. 

Pigure 24 shows construction TRi4G comprising the 
nucleotide sequence and its translation sequence of a 
single polypeptide chain binding protein prepared ac- 
cording to the methods of the invention. 

Figure 25 shows a restriction map of the expres- 
sion vector pGX3776 carrying a single chain binding 
protein, the sequence of which is shown in Pigure 24. 
in this and subsequent plasmid maps (Figures 27 and 
29) the hashed bar represents the promoter 0 L /P R se- 
quence and the solid bar represents heavy chain vari- 
able region sequences. 

Figure 26 shows the sequences of TRY61, another 
single chain binding protein of the invention. 
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Pigure 27 shows expression plasmid pGX4904 carry- 
ing the genetic sequence shown in Figure 26. 

Figure 28 shows the sequences of TRY59, another 
single chain binding protein of the invention. 

Figure 29 .shows the expression plasmid pGX 4908 
carrying the genetic sequence shown in Figure 28. 

Figures 30A r 30B, 30C, and 30D (stereo) are ex- 
plained in detail in Example 1. They show the design 
and construction of dounie linked single ch-in anti- 
body TRY40. 

Figures 31A and 31B (stereo) are explained in de- 
tail in Example 2. They show the design and construc- 
tion of single linked single chain antibo<fr TRY61. 

Figures 32A and 32B (stereo) are explained in de- 
tail in Example 3. They show the design and construc- 
tion of single linked single chain antibody TRY59. 

Figure 33 is explained in Example 4 and shows the 
sequence of TRYl04b. 

Figure 34 shows a restriction map of the expres- 
sion vector pGX4910 carrying a single linksr construc- 
tion, the sequence of which is shown in Figure 33. 

Figure 35 shows the assay results for BGH binding 
activity wherein strip one represents TRY61 and strip 
two represents TRY40. 

Figure 36 is explained in Example 4 and shows the 
results of competing the F ftb portion of 3C2 monoclonal 
with TRY59 protein, 

DETAILED DESCRIPTION OF THE PREFERRED PfBODIMENTS 
TABLE OF CONTENTS 



I. General Overview 
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and Software Environment 



III. Single Linker Embodiment 

h. Plausible Site Selection 

B. Selection of Candidates 

1. Selecting Candidates with Proper 
Distance Between the ■ Terminal and 
tas C Terminal. 

2. Selecting Candidates with Proper 
Direction Prom the ■ Terminal and 
the C Terminal. 

3. Selecting Candidates With Proper 
Orientation between the Termini. 

C. Banking and Eliminating Candidates 

IV. Double and Multiple Linker E m b od iment s 

A. Plausible Site Selection 

B. Candidate Selection and Candidate Rejec- 
tion Slepi 

V. Parallel Processing Embodiment 

VI. Preparation and Expression of Genetic 
Sequences and Uses 

I. General Overview 

The present invention starts with a computer based 
system and method for determining and displaying pos- 
sible chemical structures (linkers) for converting two 
naturally aggregated but chemically separate heavy and 
light (H and L) polypeptide chains from the variable 
region of a given antibody into a single polypeptide 
chain which will fold into a three dimensional struc- 
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ture very similar to the original structure made of 
two polypeptide chains. The original structure is 
referred to hereafter as "native protein." 

The first general step of the three general design 
steps of the present invention involves selection of 
plausible sites to be linked. In the case of a single 
linker, criteria are utilized to select a plausible 
site on £-ch =£ th* two polypeptide chains (H and L in 
the variable region) which will result in 1) a minimum 
loss of residues from the native protein chains and 2) 
a linker of minimum number of amino acids consistent 
with the need for stability. A pair of sites defines 
a sap. to be bridged or linked. 

A two-or-more-linker approach is adopted when a 
single linker can .not achieve the two stated goals, 
in both the single-linker case and the two-or-more- 
linker case, more than one gap may be selected for use 
in the second general step. 

xhe secoud gsnsral step «f the present invention 
involves examining a- data base to determine possible 
linkers to fill the plausible gaps selected in the 
first general step, so that candidates can be enrolled 
for the third general step. Specifically, a data base 
contains a large number of amino acid sequences for 
which the three-dimensional structure is known. In 
the second general step, this data base is examined to 
find which amino acid sequences can bridge the gap or 
gaps to create a plausible one-polypeptide structure 
which retains most of the three dimensional features 
of the native ( i.e. original aggregate) variable re- 
gion molecule. The testing of each possible linker 
proceeds in thr*e general substeps. The first general 
substep utilizes the length of the possible candidate. 
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Specifically, the span or length (a scalar quantity) 
of the candidate is compared to the span of each of 
the gaps. If the difference between the length of the 
candidate and the span of any one of the gaps is less 
than a selected quantity, then the present invention 
proceeds to the second general substep with respect to 
this candidate. Figure 20A shows one gap and four 
possible linkers. The first linker fails the first 
general substep because its span is quite different 
from the span of the gap. 

In the second general substep, called the direc- 
tion substep, the initial peptide of the candidate is 
aligned with the initial peptide of each gap. Speci- 
fically, a selected number of atoms in the initial 
peptide of the candidate are rotated and translated as 
a rigid body to best fit the corresponding atoms in 
the initial peptide of- each gap. The three dimension- 
al vector (called the direction of the linker) from 
the initial peptide of the candidate linker i-u ths 
final peptide of the candidate linker is compared to 
the three dimensional vector (call the direction of 
the gap) from the initial peptide of each gap to the 
final peptide of the same gap. If the ends of these 
two vectors come within a preselected distance of each 
other, the present invention proceeds to the third 
general substep of the second general step with re- 
spect to this candidate linker. 

Figure 20B shows one gap and three linkers. All 
the linkers have the correct span and the initial pep- 
tides have been aligned. The second linker fails the 
second general substep because its direction is quite 
different from that of the gap; the other two linkers 
are carried forward to the third general substep of 
the second general step. 
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In the third general substep of the second design 
of the step of the present invention, the orientations 
of the terminal peptides of each linker are compared 
to the orientations of the terminal peptides of each 
gap. Specifically, a selected number of atoms (3, 4, 
or 5, 5 in the prefered embodiment) from the initial 
peptide of the candidate pluc the selected number 

of atoms (3, 4, or 5; 5 in the prefered embodiment) 
from the final -peptide of the candidate are taken as a 
rigid body. The corresponding atoms from one of the 
gaps (viz 5 from the initial peptide and 5 from the 
final peptide) are taken as a second rigid body. 
These two rigid bodies are superimposed by . a least- 
squares fit. If the error for this fit is below some 
preselected value, then the candidate passes the third 
general substep of the second general step and is en- 
rolled for the third general step of the present in- 
vention. If the error i. y i.«tsr than or equal to the 
preselected value, the next gap is tested. When all 
gaps have been tested without finding a sufficiently 
good fit, the candidate is abandoned. 

The third general step of the present invention 
results in the ranking of the linker candidates from 
most plausible to least plausible. The most plausible 
candidate is the fragment that can bridge the two 
plausible sites of one of the gaps to form a single 
polypeptide chain, where the bridge will least distort 
the resulting three dimensional folding of the single 
polypeptide chain from the natural folding of the ag- 
gregate of the two originally chemically separate 
chains. 

In this third general step of the present inven- 
tion, an expert operator uses an interactive computer- 
graphics approach to rank the linker candidates from 
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oost plausible to least plausible. This ranking is 
done by observing the interactions between the linker 
candidate with all retained portions of the native 
protein. A set of rules are used for the ranking. 
These expert system rules can be built into the system 
so that the linkers are displayed only after they have 
satisfied the expert system .rules that are utilized. 

The present invention can be programmed so that 
certain expert rules are utilized as a first general 
substep in the third general step to rank candidates 
and even eliminate unsuitable candidates before visual 
inspection by an expert operator, which would be the 
second general substep of the third general step. 
These expert rules assist the expert operator in rank- 
ing the candidates from most plausible to least plaus- 
ible. These expert rules can be modified based on 
experimental data on linkers produced^ by the system 
and methods of the present invention. * 

The most plausible eaa«<!»t« is a genetically pro- 
ducible single polypeptide chain binding molecule 
which has a very significantly higher probability (a 
million or more as compared to a random selection) of 
folding into a three dimensional structure very simi- 
lar to the original structure made of the heavy and 
light chains of the antibody variable region than 
would be produced if random selection of the linker 
was done. In this way, the computer based system and 
method of the present invention can be utilized to 
engineer single polypeptide chains by using one or 
more linkers which convert naturally aggregated but 
chemically separated polypeptide chains into the de- 
' sired single chain. 

The elected candidate offers to the user a linked 
chain structure having a very significantly increased 



WOS8/0t649 



-18- 



PCT/US87/02208 



probability of proper folding than would be obtained 
using a random selection process. This means that the 
genetic engineering aspect of creating the desired 
single polypeptide chain is significantly reduced, 
since the number of candidates that have to be gene- 
tically engineered in practice is reduced by a corres- 
ponding amount. The most plausible candidate can be 
used to genetically engineer an actual molecule. 

Tne parameter of ths various candidates can be 
stored for later use. They can also be provided to 
the user either visually or recorded on a suitable 
media (paper, magnetic tape, color slides, etc.). The 
results of the various steps utilized in the design 
process can also be stored for later use or examina- 
tion. 

The design steps of the present invention operate 
on a conventional minicomputer system having storage 
devices Capable of storing the amino acid sequence- 
structure data base, the various application programs 
utilized and trie parameters ul ths possible linker 
candidates that are being evaluated. 

The minicomputer CPU is connected by a suitable 
serial processor structure to an interactive computer- 
graphics display system. Typically, the interactive 
computer-graphics display system comprises a display 
terminal with resident three-dimensional application 
software and associated input and output devices, such 
as X/Y plotters, position control devices (potentio- 
meters, an x-y tablet, or a mouse), and keyboard. 

The interactive computer-graphics display system 
allows the expert operator to view the chemical struc- 
tures being evaluated in the design process of the 
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present invention. Graphics and. programs are used to 
select the gaps (Gen. Step 1), and to rank candidates 
(Gen. Step 3). Essentially, it operates in the same 
fashion for the single linker embodiment and for the 
two or more linker embodiments. 

For example, during the first general step of the 
present invention, the computer-graphics interactive 
display system allows the expert opera to* to visually 
display the two naturally aggregated but chemically 
separate polypeptide chains. Using three dimensional 
software resident in the computer-graphics display 
system, the visual representation of the two separate 
polypeptide chains can be manipulated as desired. For 
example, the portion of the chain(s) being viewed can 
be magnified electronically, and such magnification 
can be performed in a 200m mode. Conversely, the im- 
age can be "reduced in size, and this reduction can 
also be done in a reverse zoom mode. The position of 
the portion of the molecule can be translated, and the 
displayed molecule can be rotated about any one of the 
three axes (x, y and z). Specific atoms in the chain 
can be selected with an electronic pointer. Selected 
atoms can be labeled with appropriate text. Specific 
portions of native protein or linker can be identified 
with color or text or brightness. Unwanted portions 
of the chain can be erased from the image being dis- 
played so as to provide the expert operator with a 
visual image that represents only a selected aspect of 
the chain (s). Atoms selected by pointing or by name 
can be placed *t the center of the three dimensional 
display; subsequent rotation uses the selected atom as 
the origin. These and other display aspects provide 
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the expert operator with the ability to visually re- 
present portions of the chains which increase the 
ability to perform the structural design process. 

One of the modes of the present invention utilizes 
a serial computational architecture. This architec- 
ture using the present equipment requires approximate- 
ly four to six hours of machine and operator time in 
order to go through the various operations required 
for the three general steps for a particular «lec^«n 
of gaps. Obviously, it would be desirable to signifi- 
cantly reduce the time since a considerable portion 
thereof is the time it takes for the computer system 
to perform the necessary computational steps. 

An alternate embodiment of the present invention 
utilizes a parallel processing architecture. This 
parallel processing architecture significantly reduces 
the time required to perform the necessary computa- 
tional steps. A hypercube of a large number of nodes 
can be utilized so that the various linkers that are 
possible for the selected sites can be iayidly pre- 
sented to the expert system operator for evaluation. 

Since there are between 200 and 300 known protein 
structures, the parallel processing approach can be 
utilized. There currently are computers commercially 
available that have as many as 1,024 computing nodes. 

Using a parallel processing approach, the data 
base of observed peptide structures can be divided 
into as many parts as there are computing nodes. For 
example, if there are structures for 195 proteins with 
219 amino acids each, one would have structures for 
195x218 dipeptides, 195x217 tripeptides, 195x216 tet- 
rapeptides, etc. One can extract all peptides up to 
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some length n. For example, if n were 30, one would 
have 195x30x204 peptides. Of course, proteins vary in 
length, but with 100 to 400 proteins of average length 
200 (for example), and for peptide linkers up to 
length 30 amino acids (or any other reasonable num- 
ber), one will have between 1,000,000 and 4,000,000 
peptide structures- Once the peptides have been ex- 
traced and labeled with the protein from which they 
came, one is free to divide all the peptides as evenly 
as possible among the available computing nodes. 

The parallel processing mode operates as follows. 
The data base of known peptides is divided among the 
available nodes. Each gap is sent to all the nodes. 
Each node takes the gap and tests it against those 
peptides which have been assigned to it and returns 
information about any peptides which fit the gap and 
therefore are candidate linkers. As the testing for 
matches between peptides and gaps proceeds indepen- 
dently in e»ch node, the searching will go faster by a 
factor equal to the number of nodes. 

A first embodiment of the present invention uti- 
lizes a single linker to convert the naturally aggre- 
gated but chemically separate heavy and light chains 
into a single polypeptide chain which will fold into a 
three dimensional structure very similar to the orig- 
inal structure made of two polypeptide chains. 

A second embodiment utilizes two or more linkers 
to convert the two heavy and light chains into the 
desired single polypeptide chain. The steps involved 
in each of these embodiments utilizing the present 
invention are illustrated in the explanation below. 
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Once the correct amino acid sequence for a single 
chain binding protein has been defined by the computer 
assisted methodology, it is possible, by methods well 
known to those with skill in the art, to prepare an 
underlying genetic sequence coding therefor. 

In preparing this genetic sequence, it is possible 
to utilize synthetic DNA by synthesizing the entire 
sequence de nov" - Alternatively, it is possible to 
obtain cDNA sequences coding for certain preserved 
portions of the light and heavy chains of the desired 
antibody, and splice them together by means of the 
necessary sequence coding for the peptide linker, as 
described. 

Also by methods known in the art, the resulting 
sequence can be amplified by utilizing well known 
cloning vectors and well known hosts. Furthermore, 
the amplified sequence, after checking for correct- 
ness, can be linked to promoter and terminator sig- 
nals, inserted into -nnrnpriate expression vectors, 
and transformed into hosts such as procaryotic or eu- 
caryotic hosts. Bacteria, yeasts (or other fungi) or 
mammalian cells can be utilized. Upon expression, 
either by itselt or as part of fusion polypeptides, as 
will otherwise be known to those of skill in the art, 
the single chain binding protein is allowed to refold 
in physiological solution, at appropriate conditions 
of pH, ionic strength, temperature, and redox poten- 
tial, and purified by standard separation procedures . 
These would include chromatography in its various dif- 
ferent types, known to those with skill in the art. 

The thus obtained purified single chain binding 
protein can be utilized by itself, in detectably la- 
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belled form, in immobilized form, or conjugated to 
drugs or other appropriate therapeutic agents, in 
diagnostic, imaging, biosensors, purifications, and 
therapeutic uses and compositions. Essentially all 
uses envisioned for antibodies or for variable region 
fragments thereof can be considered for the molecules 
of the present invention. 

H. Hardware and Software Environment 
A block diagram of the hardware aspects of the 
present invention is found in Figure 1. A central pro- 
cessing unit (CPU) 102 is connected to a first bus 
(designated massbus 104) and to a second bus (desig- 
nated Unibus 106). A suitable form for CPU 102 is a 
model Vax 11/780 made by Digital Equipment Corporation 
of Maynard, Massachusetts. Any suitable type of CPU, 
however, can be used. 

Bus 104 connects CPU 102 to a plurality of storage 
devices. In the Dest mod*, thssc stersg- devices in- 
clude a tape drive unit 106. The tape drive unit 106 
can be used, for example, to load into the system the 
data base of the amino acid sequences whose three 
dimensional structures are known. A suitable form for 
tape drive 106 is a Digital Equipment Corporation mod- 
el' TO 78 drive, which operates at 125 inches per sec- 
ond, and has a 1600-6250 bit per inch (BPI) dual capa- 
bility. Any suitable type of tape drive can be used, 
however . 

Another storage device is a pair of hard disk 
units labeled generally by reference numeral 108. A 
suitable form for disk drive 108 comprises two Digital 
Equipment Corporation Rm05 disk drives having, for 
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example, 256 Mbytes of storage per disk. Another disk 
drive system is also provided in the serial processor 
mode and is labeled by reference numeral 110. This 
disk drive system is also connected to CPU 102 by bus 
104- A suitable form for the disk system 110 compris- 
es three Digital Equipment Corporation model Ra 81 
hard disk drives having, for example, 450 Mbytes of 

storage per disk. 

Dynamic random access memory is also provided by a 
memory stage 112 also connected to CPU 102 by bus 104. 
Any suitable type of dynamic memory storage device can 
be used. In the serial processor mode, the memory is 
made up of a plurality of semi- conductor storage de- 
vices found in a DEC model Ecc memory unit. Any suit- 
able type of dynamic memory can be employed. 

The disk drives 108 and 110 store several differ- 
ent blocks of information. For example, they store 
the data base containing the amino acid sequences and 
structures that are read in by the tape drive 1Q6. 
They also store the application software package re- 
quired to search the data base in accordance with the 
procedures of the present invention. They also store 
the documentation and executables of the software. 
The hypothetical ' molecules that are produced and 
structurally, examined by the present invention are 
represented in the same format used to represent the 
protein structures in the data base. Using this for- 
mat, these hypothetical molecules are also stored by 
the disk drives 108 and 110 for use during the struc- 
tural design process and for subsequent use after the 
process has been completed. 
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A Digital Equipment Corporation VAX/VMS DEC oper- 
ating system allows for multiple users and assures 
file system integrity. It provides virtual memory, 
which relieves the programer of having to worry about 
' ' the amount of memory that is used. Initial software 

was developed under versions 3.0 to 3.2 of the VAX/ VMS 
-p-r-Hno system. The serial processor mode curre ^ lv 
Is running on version 4.4. DEC editors and FORxxAS 
compiler were utilized. 

The CPO 102 is connected by Bus 106 to a multi- 
plexer 114. The multiplexer allows a plurality of 
devices to be connected to the CPU 102 via Bus 106. A 
suitable form for multiplexer 114 is a Digital Equip- 
ment Corporation model Dz 16 terminal multiplexer. In 
the preferred embodiment, two of these multiplexers 
are used. The multiplexer 114 supports terminals (not 
shown in Figure- 1) and the serial communications Cat 
19.2 fvbcud, for example) to the computer-graphics dis- 
play system indicated by the dash lined box 116. 

The computer-graphics display system 116 includes 
an electronics stage 118. The electronic stage 118 is 
used for receiving the visual image prepared by CPO 
102 and for displaying it to the user cn a display 
(typically one involving color) 120. The electronic 
stage 118 in connection with the associated subsystems 
of the computer-graphics display system 116 provide 
for local control of specific functions, as described 
below. A suitable form of the electronics system 118 
is a model PS 320 made by Evans fc Sutherland Corp. of 
Salt Lake, Utah. A suitable form for the display 120 
is either a 25 inch color monitor or a 19 inch color 
monitor from Evans & Sutherland. 
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Dynamic random access memory 122 is connected to 
the electronic stage 118. Memory 122 allows the elec- 
tronic system 118 to provide the local control of the 
image discussed below. In addition, a keyboard 124 of 
conventional design is connected to the electronic 
stage 118, as is an x/y tablet 126 and a plurality of 
dials 128. the keyboard 124, x/y tablet 126, and 
dials 128 in the serial processus iiode arc slsc ob- 
tained from Evans & Sutherland. 

The computer generated graphics system 116, as 
discussed above, receives from CPU 102 the image to be 
displayed. It provides local control over the dis- 
played image so that specific desired user initiated 
f unctions can be performed, such as: 

(1) zoom (so as- to increase or decrease the size 
of the image being displayed; 

(2) clipping {where the sides, front o$ back of 
the image being displayed are removed); 

(3) intensity depth queing (where oojects fuxthe* 
away from the viewer are made dimmer so as to provide 
a desired depth effect in the image being displayed); 

(4) translation of the image in any of the three 
axes of the coordinate system utilized to plot the 
molecules being displayed; 

(5) rotation in any of the three directions of 
the image being displayed; 

(6) on/off control of the logical segments of the 
picture. For example, a line connecting the alpha 
carbons of the native protein might be one logical 
segment; labels on some or all of the residues of the 
native protein might be a second logical segment; a 
trace of the alpha carbons of the linker (s) might be a 



WO 88/01649 



PCT/US87/02208 



-27- 



third segment; and a stick figure connecting Carbon, 
Nitrogen, Oxygen, and Sulphur atoms of the linJter(s) 
and adjacent residue of the native protein might be a 
fourth logical segment. The user seldom wants to see 
all of these at once; rather the operator first be- 
comes oriented by viewing the first two segments at 
low magnification. Then the labels are switched off 
and the linker carbon trace i» turned c=- 0«c» the 
general features of the linker are seen, the operator 
zooms to higher magnification and turns on the seg- 
ments which hold more detail; 

(7) selection of atoms in the most detailed logi- 
cal segment. Despite the power of modern graphics, 
the operator can be overwhelmed by too much detail at 
once. Thus the operator will pick one atom and ask to 
see all amino acids within some radius of that atom, 
typically 6 Angstroms, but other,radii can be used.^ 
The user may also specify that certain amino acids 
will be included in addition to tnose thai, fall within 
the specified radius of the selected atom; 

(8) changing of the colors of various portions of 
the image being displayed so as to indicate to the 
viewer particular information using visual gueing. 

As stated above, the serial processor mode of the 
present invention currently is running the application 
software on version 4.4 of the Vax/Vms operating sys- 
tem used in conjunction with CPU 102. The applica- 
tion programs were programmed using the FLECS (FORTRAN 
Language with Extended Control Sections) programming 
language written in 1974 by Terry Beyer of the Univer- 
sity of Oregon, Eugene, Oregon. FLECS is a FORTRAN 
preprocessor, which allows more logical programming. 
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All of the code used in the serial processor mode was 
developed ia FLECS. It can be appreciated, however, 
that the present invention encompasses other operating 
systems and programming languages. 

The macromolecules displayed on color display 120 
of the computer-graphics display system 116 utilize an 
extensively modified version of version 5,6 of FRO DO. 
FRODO is a program for displaying and manipulating 
macromolecules. FRODO was written by T.A. Jones at 
Max Planck Institute for Biochemistry, Munich, West 
Germany, for building or modeling in protein crystal- 
lography* FRODO version 5.6 was modified so as to be 
driven by command files; programs were then written to 
create the command files. It is utilized by the elec- 
tronic stage 118 to display and manipulate images on 
the -color display 120. Again, any suitable type of 
program can be used for displaying and manipulating 
the macromolecules, th% coordinates of which are pro- 
TTifted to the computer-graphics display system 116 by 
the CPU 102. 

Design documentation and memos were written using 
PDL (Program Design Language) from Caine, Farber 6 
Gordon of PasaJena, California. Again, any suitable 
type of program can be used for the design documents 
and memos. 

Figure 2 shows a block diagram for an improved 
version of the hardware system of the present inven- 
tion. Like numbers refer to like items of Figure 1. 
Only the differences between the serial processor mode 
system of Figure 1 and the improved system of Figure 2 
are discussed below. 
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The CPU 102' is the latest version of the Vax 
11/780 from Digital Equipment Corporation. The latest 
processor from DEC in the VAX product family is ap- 
proximately ten times faster than the version shown in 
the serial processor mode of Figure 1. 

Instead of the two Rm05 disk drives 108 of Figure 
1, the embodiment of Figure 2 utilizes five RA 81 disk 
drive units 110'. This is to y^r-ds the present sys- 
tem to more state of the art disk drive units, which 
provide greater storage capability and faster access. 

Serial processor 106 is connected directly to the 
electronic stage 118' of the computer-graphics display 
system 116. The parallel interface in the embodiment 
of Figure 2 replaces the serial interface approach of 
the serial processor mode of Figure 1. This allows 
for faster interaction between CPU 102* and electronic 
stage 118<- so as to provide faster data display to the 
expert operator. 

Disposed in front of color display 120 is s stereo 
viewer 202. A suitable form for stereo viewer 202 is 
made by Terabit, Salt Lake City, Utah. Stereo viewer 
202 would provide better 3-D perception to the expert 
operator than can be obtained presently through rota- 
tion of the molecule. 

In addition, this embodiment replaces the FRODO 
macromolecule display programs with a program designed 
to show a series of related hypothetical molecules. 
This newer program performs the operations more quick- 
ly so that the related hypothetical molecules can be 
presented to the expert operator in a short enough 
time that makes examination less burdensome on the 
operator. 
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The programs can be modified so as to cause the 
present invention to eliminate candidates in the sec- 
ond general step where obvious rules have been vio- 
lated by the structures that are produced. For exam- 
ple, one rule could be that if an atom in a linker 
comes closer than one Angstrom to an atom in the na- 
tive structure the candidate would be automatically 
e! iminated. 

In addition, the surface accessibility of mole- 
cules could be determined and a score based on the 
hydrophobic residues in contact with the solvent could 
be determined. After the hydrophobic residues have 
been calculated, the candidates could be ranked so 
that undesired candidates could automatically be elim- 
inated. The protein is modeled in the present inven- 
tion without any surrounding matter. Proteins almost 
always exist in aqueous solution? indeed, protein 
J crystals contain between 20% and 90% water and dis- 
solved «lts which fill the space between the protein 
molecules. Certain kinds of amino acids have side- 
chains which maJce favorable interactions with aqueous 
solutions (serine, threonine, arginine, lysine, histi- 
dine, aspartic acid, glutamic acid, proline, aspara- 
gine, and glutamine) and are termed hydrophilic. 
Other amino acids have side chains which are apolar 
and make unfavorable interactions with water (phenyla- 
lanine., tryptophan, leucine, isoleucine, valine, meth- 
ionine, and tyrosine) and are termed hydrophobic- In 
natural proteins, hydrophilic amino acids are almost 
always found on the surface, in contact with solvent; 
hydrophobic amino acids are almost always inside the 
protein in contact with other hydrophobic amino acids. 
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The remaining amino acids (alanine, glycine, and cys- 
teine) are found both inside proteins and on their 
surfaces. The designs of the present invention should 
resemble natural proteins as much as possible, so hy- 
drophobic residues are placed inside and hydrophilic 
residues are placed outside as much as possible. 

Programs could be utilized to calculate an energy 
for each hypothetical, structure. In addition, pro- 
grams could make local adjustments to the hypothetical 
molecules to minimize the energy. Finally, molecular 
dynamics could be used to identify particularly un- 
stable parts of the hypothetical molecule. Although 
existing progra-s could calculate a nominal energy for 
each hypothetical structure, it has not yet been de- 
monstrated that such calculations can differentiate 
between sequences which will fold and those that will 
not. Energy minimization could also be accomplished 
with extant programs, but energy minimization also can 
not differentiate between sequences whicn will fold 
and those that will not. Molecular dynamics simula- 
tions currently cannot be continued long enough to 
simulate the actual folding or unfolding of a protein 
and so cannot distinguish between stable and unstable 
molecules. 

Two megabytes of storage 128' in the computer 
generated display system 116 is added so that several 
different molecules can be stored at the display 
level. These molecules then can be switched back- and 
forth on the color display 120 so that the expert 
operator can sequentially view them while making ex- 
pert decisions. The parallel interface that is shown 
in Figure 2 would allow the coordinates to be trans- 
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f erred faster from the CPU 102' to the electronics 
stage 118' of the computer generated display system 
116. 

The parallel processing architecture embodiment of 
the present invention is described below in Section V. 
This parallel architecture embodiment provides even 
faster analysis and display. 

m. Single Linker Embodiment 

This first embodiment of the present invention 
determines and displays possible chemical structures 
for using a single linker to convert the naturally 
aggregated but chemically separate heavy and light 
polypeptide chains into a single polypeptide chain 
which will fold into a three dimensional structure 
very similar to the original structure made of two 
polypeptide chains. 

A. Plausible If Selection 

Thcra are tvo m»in goals of the plausible site 
selection step 302 of the present invention shown in 
very generalized block diagram form in Figure 3. The 
first goal is to select a first plausible site on the 
first chain that is the minimum distance from the sec- 
ond plausible site on the second chain. The first 
point on the first chain and the second point on the 
second chain comprise the plausible site. 

The second goal of the site selection is to select 
plausible sites that will result in the least loss of 
native protein. Native protein is the original pro- 
tein composed of the two aggregated polypeptide chains 
of the variable region. It is not chemically possible 
to convert two chains to one without altering some of 
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the amino acids. Even if only one amino acid was add- 
ed between the carboxy terminal of the first domain 
and the amino terminal of the second domain, the char- 
ges normally present at these termini would be lost, 
in the variable regions of antibodies, the terminii of 
the H and L chains are not very close together. Hypo- 
thetical linkers which join the carboxy terminus of 
... ch . i= to *h«? amino terminus of the other do not 
resemble the natural variable region structures. Al- 
though such structures are not impossible, it is mora 
reasonable to cut away small, parts of the native pro- 
tein so that compact linkers which resemble the native 
protein will span the gap. Many natural proteins are 
known to retain their structure when one or more resi- 
dues are removed from either end. 

In the present embodiment, only a single linker 
.(amino acid sequence or bridge for bridging or linking 
the two plausible sites to form a single polypeptide 
. i_» i- — a »<~„T-« a shows in block diagram form 

CUO-±ll I u»ua 3 

the steps used to select plausible sites in the single 
linker. The steps of Figure 4 are a preferred embodi- 
ment of step 302 of Figure 3. 

A domain 1 is picked in a step 402 (see Figure 4). 
A schematic diagram of two naturally aggregated but 
chemically separate polypeptide chains is shown in 
Figure 5A. For purposes of illustration, assume that 
L is the light chain of the antibody variable region 
(the first polypeptide chain) and is domain 1. As 
shown in Figure 5A, light chain L is on the left side, 
and heavy chain H is on the right side. 

The next step 404 is to pick the domain 2, which, 
as indicated, is the heavy chain H of the antibody 
variable region on the right side of Figure 5A. 
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The linker that will be selected will go from do- 
main 1 (the light chain L) towards domain 2 (heavy 
chain, H). As the linker will become part of the sin- 
gle polypeptide chain, it must have the same direc- 
tionality as the polypeptides it is linking? i^ the 
amino end of the linker must join the carboxy terminal 
of some amino acid in domain 1, and the carboxy ter- 
minal of the linker must join the amino terminal of 
some residue in domain 2. A starting point ifirst 
site) on domain 1 is selected, as represented by step 
in 406 in Figure 4. The starting point is chosen to 
be close to the C (C for carboxy) terminal of domain 
1, call this amino acid tau 1. It is important to 
pick, tau 1 close to the C terminal to minimize loss of 
native protein structure. Residue tau 1 is shown 
schematically in two dimensions in figure 6A; it is 
also .shown in figure 6B where it is presented in a 
two-dimensional representation of the naturally aggre- 
gated but chemically separate H and L polypeptide 
chains. 

Next, the final point (second site) close the N (N 
for amino) terminal of domain 2 is selected, as indi- 
cated by step 408 of Figure 4. The final site is an 
amino acid of domain 2 which will be called sigma 1. 
It is important that amino acid sigma 1 be close to 
the N terminal of domain 2 to minimize loss of native 
protein structure. Amino acid sigma 1 is shown sche- 
matically in figure 6A and in the more realistic re- 
presentation of figure 6B. 

Figure 7 shows in simplified form the concept that 
the linker goes from a first site at amino acid tau 1 
in domain 1 to a second site at amino acid sigma 1 in 
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domain 2. There are a plurality of possible first 
sites and a plurality of second sites, as is shown in 
figure 7. A computer program prepares a table which 
contains for each amino acid in domain 1 the identity 
of the closest amino acid in domain 2 and the dis- 
tance. This program uses the position of the alpha 
carbon as the position of the entire amino acid. The 
expert operator prepares a ii.t of plausible 
acids in domain 1 to be the first site, tau 1, and a 
list of plausible amino acids in domain 2 to be the 
second site, sigma 1. Linkers are sought from all 
plausible sites tau 1 to all plausible sites sigma 1. 
The expert operator must exercise reasonable judgement 
in selecting the sites tau 1 and sigma 1 in deciding 
that certain amino acids are more important to the 
stability of the native protein than are other amino 
acids. Thus the operator may select site| which are 
not actually the closest. 

The complete designed protein molecule in accor- 
dance with the present invention consists of the dom- 
ain 1 (of the light chain L) up to the amino acid tau 
1, the linker, as shown by the directional-line in 
Figure 8A and in Figure 8B, and the domain 2 from ami- 
no acid sigma 1 to the C terminus of the heavy chain, 
H. As shown in Figures 8A and 8B, in the representa- 
tive example, this results in the following loss of 
native protein. 

The first loss in native protein is from the resi- 
due after residue tau 1 to the C terminus of domain 1 
(light chain L). The second loss of native protein is 
from the N terminus of domain 2 (heavy chain, H) to 
the amino acid before sigma 1. 
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As is best understood from Figure 8A f the intro- 
duction of linker 1 produces a single polypeptide 
chain from the two naturally aggregated chains. The 
polypeptide chain begins with the N terminal of domain 
* ( 1. Referring now to Figure 8B, the chain proceeds 
through almost the entire course of the native light 
chain, L, until it reaches amino acid tau 1. The 
linker then connects the uxboxy tsrmin^l cf a very 
slightly truncated domain 1 to residue siqma 1 in the 
very slightly truncated domain 2. Since a mini m um 
amount of native protein is eliminated, and the linker 
is selected to fit structurally as well as possible 
(as described below in connection with general steps 2 
and 3 of the present invention), the resulting single 
polypeptide chain has a very high probability (several 
orders of magnitude greater than if the linker was 
selected randomly) to fold intop & three-dimensional 
structure very similar to the original structure made 
of two polypeptide chains. 

The single polypeptide chain results in a much 
more stable protein which contains a binding site 
very similar to the binding site of the original an- 
tibody. In this way a single polypeptide chain can be 
engineered from the naturally occuring two-polypep- 
tide chain variable region, so as to create a polypep- 
tide of only one chain, but maintaining the binding 
site of the antibody. 

In the current mode of the present invention r the 
expert operator selects the sites with minimal help 
from the computer. The computer prepares the table of 
closest-residue-in-other-domain. The computer can 
provide more help in the following ways. 
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(1) Prepare a list of conserved and variable res- 
idues for variable region, of antibodies (Fv region) . 
Residues which vary from Fv to Fv would be much better 
starting or ending sites for linkage than are residues 
wh ich are conserved over many different Fv ^uences 

(2 ) Prepare a list of solvent accessibilities, 
^no acids exposed to solvent can be substituted with 
7\. nf destabiliiing the native structure 
thin Zlo acids buried within the native structure. 
Exposed amino acids are better choices to start or end 

Unk wlth respect to each of the plurality of possible 
first sites (on domain 1 or light chain L) there are 
available a plurality of second sites (on domain 2 or 
neavy chain H) (See Figures 7 and 8A> . As the second 
site is selected closer to the H terminus of domain 2, 
the distance to any of the plausible first sites in- 
creases. Also, aslthe first site is selected closer 

.. _ , f ,w,in 1 the distance to any of 

to tiie v- — 

the plausible second sites increases. It is this ten- 
sion between shortness of linker and retention of na- 
tive protein which the expert operator resolves in 
choosing gaps to be linked. The penalty for including 
extra sites in the list of gaps are: 

(1, searching in general step 2 will be slower, 

(2) more candidates will pass from step 2 many of 
which must be rejected in step 3. As step 3 is cur- 
rently a manual step, this is the more serious penal- 
Figure 8B shows diagramatically by a directional arrow 
the possible links that can occur between the various 
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sites near the C terminal of domain 1 and the various 
sites near the N terminal of domain 2. 

B. Selection of Candidates 
In the second of the three general steps of the 
present invention as used in the single linker embodi- 
ment, plausible candidates for linking the site 1 on 
domain 1 with site 2 on domain 2 are selected from a 
much larger group of candidates* This process of win- 
nowing out candidates results in the expert operator 
and/or expert system having a relatively small group 
of candidates to rank, from most plausible to least 

m 

plausible in the third general step of the present 
invention, as described in subsection C below. 

Currently, there are approximately 250 protein 
structures, determined at 2.0 A or higher resolution, 
in the public domain. The structures of these very 
complicated molecules are determined using sophisti- 
cated scientific techniques such as X-ray crystallo- 
graphy, neutron diffraction, and nuclear magnetic res- 
onance. Structure determination produces a tile of 
data for each protein. The Brookhaven Protein Data 
Bank (BPOB) exemplifies a repository of protein struc- 
tural information. Each file in BPDB contains many 
records of different types. These records carry the 
following information: 

(1) Name of the protein and standard classifica- 
tion number, 

(2) Organism from which protein was obtained, 

(3) Name and address of contributor, 

(4) Amino-acid sequence of each polypeptide chain, 
if known, 

(5) Connectivity of disulfides, if any,- 
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(6) Names and connectivities of any prosthetic 

groups, if any, 

(7) References to literature, 

(8) Transformation from reported coordinates to 
crystallographic coordinates, 

(9) Coordinates of each atom determined. 

There is at least one record for each atom for 
wnich a coordinate »»« determined. Some parts of some 
proteins are disordered and do not diffract X-rays, so 
no sensible coordinates can be given. Thus there may 
be amino acids in the sequence for which only some or 
none of the atoms have coordinates. Coordinates are 
given in Angstrom units (100,000,000 A - 1 cm) on a 
rectangular Cartesian grid. As some parts of a pro- 
tein may adopt more than one spatial configuration, 
there may be two or more coordinates for some atoms, 
in such cases, fractional occupancies are given for 
each alternative position. Atoms move about, some 
more freely than ethers. X-ray data can give an esti- 
mate of atomic motion which is reported as a tempera- 
ture (a.k.a. Debye-Waller) factor. 

Any other data base which included, implicitly or 
explicitly, the following data would be equally use- 
ful: 

(1) Amino acid sequence of each polypeptide chain.. 

(2) Connectivity of disulfides, if any, 

(3) Names and connectivities of any prosthetic 

groups, if any, 

(4) Coordinates <x, y, z) of each atom in each 

observed configuration. 

(5) Fractional occupancy of each atom, 

(6) Temperature factor of each atom. 



PCT/US87/02208 

WO 88/01649 

-40- 



Proteins usually exist in aqueous solution. Al- 
though protein coordinates are almost always deter- 
mined for proteins in crystals, direct contacts be- 
tween proteins are quite rare. protein crystals con- 
tain from 20% to 90% water by volume. Thus one usual- 
ly assumes that the structure of the protein in solu- 
tion will be the same as that in the crystal. It is 
xiuw generally accepted that the solution structure of 
a protein will differ from the crystal structure only 
in minor details. Thus, given the coordinates of the. 
atoms, one can calculate quite easily the solvent ac- 
cessibility of each atom* 

In addition, the coordinates implicitly give the 
charge distribution throughout the protein. This is 
of use in estimating whether a hypothetical molecule 
(made of native protein and one or more linkers) will 
fold as designed. The typical protein whose structure 
is known comprises a chain of amino acids (there are 
Zi types uZ amine -cids! in range of 100 to 300 

amino acids. 

Each Of these amino acids alone or in combination 
with the other amino acids as found in the known pro- 
tein molecule can be used as a fragment to bridge the 
two sites. The reason that known protein molecules 
are used is to be able to use known protein fragments 
for the linker or bridge. 

Even with only 250 proteins of known structure , 
the number of possible known fragments is very large. 
A linker can be from one to twenty or thirty amino 
acids long. Let "Lmax" be the maximum number of amino 
acids allowed in a linker, for example, Lmax might be 
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25. Consider a protein of "Naa- amino acids. Pro- 
teins have Naa in the range 100 to 800, 250 is typi- 
cal. Prom this protein one can select Naa-1 distinct 
two-amino-acid linkers, Naa-2 distinct tbree-amino- 
acid linkers,... and (Naa+1-Lmax> distinct linkers con- 
taining exactly Lmax amino acids. The total number of 
linkers containing Lmax or fewer linkers is -mink," 



Nlink - > (Haa+l-j) 



j«l , Lmax 

- Naa x (Lmax) - (Lmax x Lmax)/2 + Lmax /2 
If Naa is 250 and Lmax is 25, Nlink will be 5975. If 
the number of known proteins is "Nprot," then the 
total number of linkers, "Nlink_total" will be 





k-1, Nprot J-l» Lmax 



[Naa(k)x(Lmax) - (LmaxxLmax) /2+Lmax/2] 
k-1, Nprot 



Nprotx(Lmax/2-Lmax x Lmax)/2 + Lmax x/Naatk) 



K-1, Nprot 
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Where Naa(k) is the number of amino acids in the kth 
protein. With 250 proteins, each containing 250 amino 
acids Con average), and Lmax set to 25, Nlink_total is 
1,425,000. 

This is the number of linkers of known structure. 
If one considers the number of possible amino acid 
sequences up to length Lmax (call it ■Nlink_possi- 
Llo"J, it i« m«^H larger. 



Nlink_possible » / 20 J 



J « l,Lmax 

For Lmax » 25 

Nlink_possible - 353, 204, 547, 368, 421, 052, 
631, 578, 947, 368, 420 

« 3.53 * 10 32 

Using known peptide fragments thus reduces the possi- 
bilities Ly twenty- six o^rs of maqnitude. Appropri- 
ate searching through the known peptide fragments re- 
duces the possibilities a further five orders of mag- 
nitude. 

Essentially, the present invention utilizes a se- 
lection strategy for reducing a list of possible can- 
didates. This is done as explained below in a prefer- 
red form in a three step process. This three step 
process, as is illustrated in the explanation of the 
each of the three steps of the process, significantly 
reduces the computer time required to extract the most 
promising candidates from the data base of possible 
candidates. This should be contrasted with a serial 
search throughout the entire data base of candidates, 
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which would require all candidates to be examined in 
total. The present invention examines certain speci- 
fic parameters of each candidate, and uses these para- 
meters to produce subgroups of candidates that are 
then examined by using other parameters. In this way, 
the computer processing speed is significantly in- 
creased. 

Ths be«t mode of the present invention uses a pro- 
tein data base created and supplemented by the Brook- 
haven National Laboratory in Dpton, Long Island, New 
York. This data base is called the Brookhaven Protein 
Data Base (BPDB). It provides the needed physical and 
chemical parameters that are needed by the present 
invention. It should be understood, that the candi- 
date linkers can be taken from the Brookhaven Protein 
Data Base or any other source of three-dimensional 
protein structures. These sources must accurately 
represent the proteins. In the current embodiments 
X-ray structures determined at resolution of 2.5A or 
higher and appropriately refined were used. Each pep- 
tide is replaced (by least-squares fit) by a standard 
planar peptide with standard bond lengths and angles. 
Peptides which do not accurately match a standard pep- 
tide ( e.g. cia peptides) are not used to begin or end 
linkers, but may appear in the middle. 

Each sequence up to some maximum number of amino 
acids (Lmax) is taken as a candidate. In the prefer- 
red embodiment, the maximum number of amino acids 
(Lmax) is set to 30. However, the present invention 
is not limited to this number, but can use any maximum 
number that is desired under the protein engineering 
circumstances involved. 
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1. Selecting Candidates with Proper Dis- 
tance Between the Terminal and the C Terminal. 

The first step in the selection of candidates step 
is to select the candidate linkers with a proper dis- 
tance between the N terminal and the C terminal from 
all of the candidate linkers that exist in the protein 
5a ta b* s * that is being used. Figure 9 shows in block 
diagram form the steps that make up this candidate 
selection process utilizing distance as the selection 
parameter. 

Referring to Figure 9, a standard point relative 
to the peptide unit at the first site is selected, as 
shown by block 902. 

A standard point relative to the peptide unit in 
the second site is also picked, as indicated by a 
.block 904. Note that in the best mode the geometric 
centers of the peptide units of the firsthand second 
ait» nrc r»«e<V but any other standard point can be 
utilized, if desired. 

The distance between the standard paints of the 
two peptides at the first and second sites defining 
the gap to be bridged by the linker is then calculat- 
ed, as indicated by block 906. This scalar distance 
value is called the Span of the gap. Note that this 
scalar value does not include any directional informa- 
tion. 

Next, as indicated by a step 908, the distance 
between the ends of the possible linker candidates are 
calculated. The distance between the ends of a par- 
ticular candidate is called the span of the candidate. 
Note that each possible linker candidate has a span of 
the candidate scalar value. 
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The final step in the distance selection candidate 
selection process is that of a step 910. In step 910, 
candidates are discarded whose span of the candidate 
values differ from the span of the gap value by more 
than a preselected amount (this preselected amount is 
Max LSQFIT error). In the best mode of the present 
mention, the preselected amount for Max LSQFIT error 
is 0.50 Angstroms. However, any other suitable 
can be used. 

The preceding discussion has been for a single 
gap . In fact, the expert user often selects several 
gaps and the search uses all of them. The span of 
each candidate is compared to the span of each gap 
until it matches one, within the preset tolerance, or 
the list of gaps is exhausted. If the candidate mat- 
ches none of the gaps, it is discarded. If it matches 
any gap it is carried to the next stage. 

Th- inventors have determined that the use of the 
distance as the first parameter for discardiny possi- 
ble linker candidates results in a significant reduc- 
tion in the number of possible candidates with a mini- 
mum amount of computer time that is needed. In terms 
of the amount of reduction, a representative example 
(using linkers up to 20 amino acids) starts out with 
761,905 possible candidates that are in the protein 
data base. This selection of candidates using the 
proper distance parameter winnows this number down to 
approximately 63,727 possible candidates. As is dis- 
cussed below, the distance selection operation re- 
quires much less computer time than is required by the 
other two steps which make up this selection step 304. 
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The result of this selection of candidates accord- 
ing to proper distance is a group (called a first 
group of candidates) which exhibit a proper length as 
compared to the gap that is to be bridged or linked. 
This first group of candidates is derived from the 
protein data base using the distance criteria only. 

2. Selecting Candidates wi i-h rrcssr nirectlon ^rom_g_ 
Terminal to C Terminal 

This substep essentially creates a second group of 
possible candidates from the first group of possible 
candidates which was produced by the distance selec- 
tion substep discussed in connection with Figure 9. 
The second group of candidates is selected in accord- 
ance with the orientation of the C terminal residue 
( i.e. the final residue) of the linker with respect to 
the N terminal residue (i^ the initial residue) 
which is compared to the orientation of the C terminal 
residue ( i.e. the second site) of the gap with respect 
to the N terminal residue ( i.e. the first site). See 
Figure 20B. In this way, this direction evaluation 
determines if the chain of the linker ends near the 
second site of the gap, when the amino terminal amino 
acid of the linker is superimposed on the first site 
of the gap so as to produce the minimum amount of un- 
wanted molecular distortion. 

Referring now to Figure 10, the first step used in 
producing the second group of possible candidates is a 
step 1002. In step 1002 a local coordinate system is 
established on the N terminal residue of one of the 
selected gaps. For example, one might take the local 
X-axis as running from the first alpha carbon of the tJ 
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terminal residue to the second alpha carbon of the N 
terminal residue, with the first alpha carbon at the 
origin • the second alpha carbon on the plus X-axis. 
The local Y-axis is selected so that the carbonyl oxy- 
gen lies in the xy plane with a positive y coordinate. 
The local Z-axis is generated by crossing X into Y. 
Next, as indicated by step 1004, a standard reference 
point in the C terminal reside of the g»p is located 
and its spherical polar coordinates are calculated in 
the local system. The standard reference point could 
be any of the atoms in the C terminal peptide 
(throughout this application, peptide, residue, and 
amino acid are used interchangeably) or an average of 
their positions. Steps 1002 and 1004 are repeated for 
all gaps in th- list of gaps. As indicated by step 
1006, a local coordinate system is established on the 
N terminal residue of one of the candidates. This 
local* coordinate system must be established in the 
same manner used for the local cooiOinats system* es- 
tablished on each of the gaps. Various local systems 
could be used, but one must use the same definition 
throughout. In step 1008, the standard reference 
point is found in the C terminal residue of the cur- 
rent candidate. This standard point must be chosen in 
the same manner used for the gaps. The spherical pol- 
ar coordinates of the standard point are calculated in 
the local system of the candidate. (This use of local 
coordinate system is completely equivalent to rotating 
and translating all gaps and all candidates so that 
their initial peptide lies in a standard position at 
the origin.) In step 1010, the spherical polar coor- 
dinates of the gap vector (r, theta, phi) are compared 
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to the spherical polar coordinates of the candidate 
vector (r, theta, phiK In step 1012 a preset thresh- 
hold is applied, if the two vectors agree closely 
enough , then one proceeds to step 1014 and enrolls the 
candidate in the second group of candidates. Current- 
ly, this preset threshhold is set to 0.5 A, but other 
values could be used. From step 1014, one skips for- 
ward to step 1022, vide infrau On the other hand, if 
the vectors compared in atep 1012 *re not close 
enough, one moves to the next gap vector in the list, 
in step 1016. If there are no more gaps, one goes to 
step 1018 where the candidate is rejected. If there 
are more gaps, step 1020 increments the gap counter 
and one returns to step 1010. From steps 1014 or 1018 
one comes to step 1022 where one tests to see if all 
candidates have been examined. If" not, step 1024 in- 
crements the candidate counter and one returns to step 
1006. If all candidates have 'been examined, one has 
finished, step 1026. 

Figure 11 shows the concept of comparing, the di- 
rection of the gap to the direction of the candidate. 

The inventors have determined that in the example 
discussed above where 761,905 possible candidates are 
in the protein data base, the winnowing process in 
this step reduces the approximate 63,727 candidates in 
the first group to approximately 50 candidates in the 
second group. The inventors have also determined that 
as referenced to the units of computer time referred 
to above in connection vith the scalar distance para- 
meter, it takes approximately 4 to 5 computer units of 
time to perform the selection of this step. Thus, it 
can be appreciated that it preserves computer time to 
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perform the distance selection first, and the direc- 
tion selection second since the. direction selection 
process takes more time than the distance selection 
process. 

3. selecting Candidates with Proper Orientation 

at Both Termini 

In this step, the candidates in the second group 
of ai-*v 1015 s f Figure 10 are winnowed down to produce 
a third group of plausible candidates using an evalua- 
tion of the relative orientation between the peptide 
groups at either end of the candidate, compared to "the 
relative orientation between the peptide groups at 
either end of the gap. In a step 1201, (Figure 12) 
decide that a peptide will be represented by 3, 4, or 
5 atoms ( vide infra) . Specifically, in a step 1202, 
one of the candidates in the second group (step 1014) 
is selected for testing. In a step 1204, three to 
five- atoms in the first peptide are selected to define 
the orientation of the first peptide. So long as the 
atoms are not collinear, three atoms is enough, but 
using four or five atoms makes the least-squares pro- 
cedure which follows over-determined and therefore 
compensates for errors in the coordinates. For exam- 
ple, assume selection of four atoms: C alpha, C, N, 
and C beta . Next, in a step 1206, one selects the 
corresponding 3,4, or 5 atoms from the final peptide 
of the selected candidate. These 6, 8, or 10 atoms 
define a three-dimensional object. In a step 1208, 
select one of the gaps. Select the corresponding 6, 
8, or 10 atoms from the gap. In a step 1210, least- 
squares fit the atoms from the candidate to the atoms 
from the gap. This least-squares fit allows degrees 
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of freedom to superimpose the two three-dimensional 
objects. Assume that one object is fixed and the 
other is free to move. Three degrees of freedom con- 
trol the movement of the center of the free object. 
Three other degrees of freedom control the orientation 
of the free object. In a step 1212, the result of the 
least-square fit is examined. If the Root-Mean- Square 
(Pjusi error is less than some preset threshhold, the 
the candidate is a good fit for the gap Deing cuusi 
dered and is enrolled in the third group in a step 
1214. If, on the other hand,- the RMS error is greater 
than the preset threshhold, one checks to see if there 
is another gap in the list in a step 1216. If there 
is, one selects the next gap and returns to step 1208. 
If there are no more gaps in the list, then the cur- 
rent candidate from .the second group is rejected in 
step 1218. In step 1220, one checks to see if there 
are more candidates in the second group; if so, a new 
candidate selected and one returns to step 1201. 
If there are no more candidates, one is finished (step 
1222). Again referring to a representative case, 
where linkers of length up to twenty amino acids were 
sought for a single gap with separation 12.7 A, the 
protein data bank contained 761,905 potential linkers. 
Of these, 63,727 passed the distance test. The direc- 
tion test removed all but 50 candidates. The orien- 
tation test passed only 1 candidate with RMS error 
less than or equal to 0.5 A. There were two addition- 
al candidates with RMS error between 0.5 A and 0.6 A. 
Moreover, the inventors have determined that it takes 
about 25 units of computer time to evaluate each can- 
didate in group 2 to decide whether they should be 
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selected for group 3. It can be appreciated now that 
the order selected by the inventors for the three 
steps of winnowing the candidates has been selected so 
that the early steps take less time per candidate than 
the following steps. The order of the steps used to 
select the candidate can be changed, however, and 
still produce the desired winnowing process. Logical- 
ly, one might even omit steps one and two and pass all 
candidates through the least- squares process depicted 
in Figure 12 and achieve the same list of candidates, 
but at greater cost in computing. This may be done in 
the case of parallel processing where computer time is 
plentiful, but memory is in short supply. 

Another approach (not illustrated) for determining 
whether the proper orientation exists between the ends 
of the candidate, is to examine only the atoms at the 
C terminal of the candidate as compared to the atoms 
at the final peptide of the gap. In step 2, the in- 
ventors aligned the first pepi-ids of the candidate 
with the first peptide in the gap. Having done this, 
one could merely compare the atoms at the C terminal 
of the candidate with the atoms of the second peptide 
of the gap. This approach is inferior to that discus- 
sed above because all the error appears at the C ter- 
minus, while the least-squares method discussed above 
distributes the errors evenly. 

C. Ranking and Eliminating Candidates . 

As shown in Figure 3, the third general step in 
the present invention is that of ranking the plausible 
candidates from most plausible to least plausible, and 
eliminating those candidates that do not appear to be 
plausible based on criteria utilized by an expert 
operator and/or expert system. 
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In the best mode, the candidates in the third 
group (step 1214) are provided to the expert operator r 
who can sequentially display them in three dimensions- 
utilizing the computer-graphics display system 116. 
1 i The expert operator then can make decisions about the 
candidates based on knowledge concerning protein chem- 
istry and the physical relationship of the plausible 
candidate with respect to the gap heing btiuyed. This 
analysis can be used to rank the plausible candidates 
in the third group from most plausible to least plaus- 
ible* Based on these rankings , the most plausible 
. candidates can be selected for genetic engineering. 

As noted above in connection with the illustrative 
example, there are typically few (under 100) candi- 
dates which make it to the third group of step 1214. 
Consequently, a moderately expert operator (one having 
a Bachelor of Science degree in chemistry, for exam- - 
plei . can typically winnow down this number of plaus- 
ible candidates to a group of 10 to 15. Thereafter, a 
more expert operator and/or expert system can further 
winnow down the number. In this way, only a very few 
of the plausible candidates needs to be tested in 
practice as compared to the hundreds, thousands or 
more of candidates that would have to be tested if no 
selection process like that of the present invention 
was used. This speeds up the process of engineering 
the single chain molecules by orders of magnitude, 
while reducing costs and other detriments by orders of 
magnitude as well. 

In certain situations, however, automatic rank- 
ing in this third general step may be warranted. This 
could occur, for example, where the expert operator 
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was presented with quite a few candidates in the third 
group or where it is desired to assist the expert 
operator in making the ranking selections - and elimin- 
ating candidates based on prior experience that has 
been derived from previous engineering activities 
and/or actual genetic engineering experiments. 

Referring now to Figure 13/ a coordinate listing 
of tne hypoLii«ticsl saclecule (candidate) is automati- 
cally constructed, as is indicated by a block 1302. 
The expert operator can then display using a first 
color the residues frpm domain 1 of the native pro- 
tein. Color display 120 can provide a visual indi- 
cation to the expert operator of where the residues 
lie in domain 1* This is indicated by a block 1304. 

The expert operator then can display on color dis- 
play 120 the residues from domain 2 of the native pro- 
tein using a second color, as is indicated by a block 
1306. The use of a second color provides a visual 
indication to the use* which assists ^ distinguishing 
the residues from domain 1 from the residues from 
domain 2. 

The linker (candidate) being ranked can be dis- 
played in a selected color, which color can be differ- 
ent from the first color of step 1304 and/or the sec- 
ond color from step 1306. Again, by using this visual 
color indication, the expert operator can distinguish 
the residues of domain 1 and 2 of the native protein. 
This display of the linker candidate is indicated by a 
block 1308 • 

The initial picture on the color display 120 pro- 
vided to the exnert operator typically shows the alpha 
carbons for all of the residues. This is indicated by 
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a block 1310. In addition, the initial picture shows 
the main-chain and side-chains for residues and lin- 
kers and one residue before the linker and one residue 
after the linker. This is indicated by a block 1312. 

The expert operator can also cause any of the 
other atoms in the native protein or linker candidate 
to be drawn at will. The molecule can be rotated, 
translated, and enlarged or reduced, by operator com- 
mand, as was discussed generally in connection with 
the computer-graphics display system 116 above. The 
block diagram of Figure 13 indicates that each of the 
steps just discussed are accomplished in serial fash- 
ion. However, this is only for purposes of illustra- 
tion. It should be understood that the operator can 
accomplish any one or more of these steps as well as 
other steps at will and in any sequence that is de- 
sired in connection with the ranking of the plausible 

if 

candidates in group 3. 

The expert operator ana/or expert system utili«<? 
in this third general step in ranking the candidates 
from most plausible to least plausible and in elimin- 
ating the remaining candidates from group 3, can use a 
number of different rules or guidelines in this selec- 
tion process. Representive of these rules and guide- 
lines are the following which are discussed in connec- 
tion with Figure 14. Note that the blocks in Figure 
14 show the various rules and/or criteria, which are 
not necessarily utilized in the order in which the 
boxes appear. The order shown is only for purposes of 
illustration. Other rules and/or criteria can be 
utilized in the ranking process, as well. 
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As shown in step 1402, a candidate can be rejected 
if any atom of the linker comes closer than a minimum 
allowed separation to any retained atom of the native 
protein structure. ' In the best mode, the minimum al- 
lowed separation is set at 2.0 Angstroms. Note that 
any other value can be selected. This step can be 
automated, if desired, so that the expert operator 
dees not have to manually perform this elimination 
process. 

A candidate can be penalized if the hydrophobic 
residues have high exposure to solvent, as is indicat- 
ed by a block 1404. The side chains of phenylananine, 
tryptophan, tyrosine, leucine, isoleucine, methionine, 
and valine do not interact favorably with water and 
are called hydrophobic. Proteins normally exist in 
saline aqueous solution? the solvent consists of polar 
molecules (HjO) and ions. 

A candidate $an be penalized when the hydrophilic 
residues h»«« low exposure to solvent. The side 
chains of serine, threonine, aspartic acid, glutamic 
acid, asparagine, glutamine, lysine, arginine, and 
proline do interact favorably with water and are 
called hydrophilic. This penalization step for hydro- 
philic residues is indicated by a block. 1406. 

A candidate can be promoted when hydrophobic resi- 
dues have low exposure to solvent, as is indicated by 
a block 1408. 

A candidate can be promoted when hydrophilic resi- 
dues have high exposure to solvent, as indicated by a 
block 1410. 

A candidate can be penalized when the main chain 
fails to form hydrogen bonds, as is indicated by a 
block 1412. 
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A candidate can be penalized when the main chain 
makes useless excisions into the solvent region. 
Useless excursions are those which do not make any 
evident interaction with the retained native protein. 
This is indicated by a block 1414. 

A candidate can be promoted when the main chain 
forms a helix, as is indicated by a block 1416- Hil- 
ic-»« are self-stabilizing. Thus a linker which is 
helical will be acre stable because its main-chain 
polar atoms (0 and N> will form hydrogen bonds within 
the linker. 

As is indicated by a block 1418, a candidate can 
be promoted when the main chain forms a beta sheet 
which fits against existing beta sheets. The strands 
of beta sheets stabilize each other. If a linker were 
found which was in a beta-sheet conformation such that 
it would extend an existing beta sheet, this inter- 
action would stabilize both the linker and the native- 
prctcia. 

Another expert design rule penalizes candidates 
which have stertcally bulky side chains at undesirable 
positions along the main chain. Furthermore, it is 
possible to "save" a candidate with a bulky side chain 
by replacing the bulky side chain fay a less bulky one. 
For example if a side chain carries a bulky substitu- 
ent such as leucine or isoleucine, a possible design 
step replaces this amino acid by a glycine, which is 
the least bulky side chain. 

Other rules and/or criteria can be utilized in the 
selection process of the third general step 306, and 
the present invention is not limited to the rules 
and/or criteria discussed. For example, once the 
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linker has been selected it is also possible to add, 
delete, or as stated, modify one or more amino acids 
therein, in order to accomplish an even better 3-D 
fit. 

rv. Double and Mnltiole Lin fc«r Embodiments 

Section III above described the single linker em- 
bodiment in accordance with the preset invention. 
This section describes double linker and multiple lin- 
ker embodiments in accordance with the present inven- 
tion. For brevity purposes, only the significant dif- 
ferences between this embodiment and the single linker 
embodiment will be described here and/or illustrated 
in separate figures. Reference should therefore be 
made to the text and figures that are associated with 
the single linker embodiment 

A. Plausible site Selection . 

The two main goals of minimizing distance between 
the sites to be linked and the least loss of native 
protein apply in the site selection in the double and 
multiple linker embodiments as they did apply in the 
single linker embodiment discussed above. 

Pigure ISA shows a simplified two dimensional rep- 
resentation of the use of two linkers to create the 
single polypeptide chain from the two naturally aggre- 
gated but chemically separate polypeptide chains. 
Pigure 15B shows in two dimensions a three dimensional 
representation of the two chains of Figure 15A. Refer- 
ring now to Figures 15A and B, the first step in de- 
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termiaing suitable sites is to find a site in domain 1 
which is close to either the Cor N terminus of domain 
2. For purposes of illustration, and as is shown in 
Figures 15A and 15B, it is assumed that the most pro- 
mising location is the C terminus of domain 2- The 
residue in domain 1 is called Tau l f while the residue 
in domain 2 is called Sigma 1. 

Figures 16A and 16B are respectively two dimen- 
sional simplified plots of the two chains, and two 
dimensional plots of the three dimensional representa- 
tion of the two chains. They are used in connection 
with the explanation of how plausible sites are selec- 
ted for the second linker in the example situation. 

The first step in connection with finding plausi- 
ble sites for the second linker is to find a residue 
in domain 1 that is before Tau 1 in the light chain. 
This residue is called residue Tau 2. It is shown in 
the top portion in Figure 16A, and in the right middle 
portion La Figure 162 . 

The next step in the site selection process for 
the second linker is to find a residue in domain 2 
near the N terminus of domain 2. This residue is 
called residue Sigma 2. Reference again is made to 
Figures 16A and B to show the location of Sigma 2. 

The second linker (linker 2) thus runs from Tau 2 
to Sigma 2. This is shown in Figures 17A and 17B. 
Note- that the chain that is formed by these two lin- 
kers has the proper direction throughout. 

Figure 18 shows in two dimensional simplified form 
the single polypeptide chain that has been formed by 
the linking of the two independent chains using the 
two linkers. *ote that the approach' outlined above 
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resulted in the minimal loss of native protein. The 
completely designed protein is shown in Figure 17 and 
consists of domain 1 from the N terminal to Tau 2, 
linker 2, domain 2 from Sigma 2 to Sigma 1, linker 1, 
and domain 1 from Taul to the C terminus. The arrows 
that are shown in Figure 17 indicate the direction of 
the chain. 

Figure 17 .Lows that th- residues lost by the 
utilization of the two linkers ares (a) from the N 
terminus of domain 2 up to the residue before Sigma 2; 
and (b) from the residue after Sigma 1 to the C termi- 
nus of domain 2; and CO from the residue after Tau 2 
to the residue before Tau 1 of domain 1. 

If one of the linkers in the two linker case is 
very long, one could link from Tau 2 to a residue in 
domain 2 after Sigma 1. A third linker (not shown) 
would then be sought from a residue near the C termi- 
nal of domain 2 to a residue near the N terminal of 
domain 2. 

Additionally, one could use two linkers to recon- 
nect one of the domains in such a way that a single 
linker or a pair of linkers would weld the two domains 

into one chain. 

B. Candidate Selection and Candidate Rejec- 
tion Steps 

Ranking of linkers in the multilinker cases fol- 
lows the same steps as in the single linker case ex- 
cept there are some additional considerations. 

(1) There may be a plurality of linkers for 
each of the two (or more) gaps to be closed. One must 
consider all combinations of each of the linkers for 
gap A with each of the linkers for gap B. 
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(2) One must consider the interactions be- 
tween linkers. 

As one must consider combinations of linkers f the 
ranking of individual linkers is used to cut down to a 
small number of very promising linkers for each gap. 
If ^one has only three candidates for each gap, there 
are nine possible constructs. 

The process of examining interactions between lin- 
kers and discarding poor can 01 dates can be Automated 
by applying the rules discussed above. 

V. Parallel Processing Bnbodiment 

Figure 19 shows in block diagram form the parallel 
processing approach that can be utilized in the pres- 
ent invention* 

As shown in Figure 19 , a friendly serial processor 
1902 is connected by a first bus 1904 to a plurality 
of data storage devices and input devices. Specific- 
ally, and only for purposes of illustration, a tape 
input stage 1906 is connected to bus 1SG4 so -~ to 
read into the system the parameters of the protein 
data base that is used. A high storage disk drive 
system 1908 (having, for example, 5 gigabits of 
storage) is also connected to bus 1904. 
Operationally, for even larger storage capabilities, 
an optical disk storage stage 1910 of conventional 
design can be connected to bus 190 4. 

The goal of the hypercube 1912 that is connected 
to the friendly serial processor 1902 via a bi-direc- 
tional bus 1914 is twofold: to perform searching fas- 
ter, and to throw out candidates more automatically. 
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The hypercube 1912, having for example, 2 to 2 
nodes provides for parallel processing. There are 
computers currently available which have up to 1,024 
computing nodes. Thus each node would need to hold 
only about 1400 candidate linkers and local memory of 
available machines would be sufficient. This is the 
concept of the hypercube 1912. Using the hypercube 
parallel processing approach, the protein data base 
can be divided into as many parts as there are coa- 
ting nodes. Each node is assigned to a particular 
known protein structure. 

The geometry of the gap that has to be bridged by 
a linker is sent by the friendly serial processor 1902 
via bus 1914 to the hypercube stage 1912. Each of the 
nodes in the hypercube 1912 then processes the geome- 
trical parameters with respect to the particular can- 
didate linker to which it is assigned. Thus, all of 
the candidates can be examined in a_ parallel fashion, 
as oppose* to the serial fashion that is done in the 
present mode of the present invention. This results 
in much faster location (the inventors believe that 
the processing apeed can be brought down from 6 hours 
to 3 minutes using conventional technology) in locat- 
ing the candidates that can be evaluated by the second 
step 304 of the present invention. 

Another advantage for the parallel processing em- 
bodiment is that it will provide sufficient speed to 
allow candidates to be thrown out more automatically. 
This would be achieved using molecular dynamics and 
energy minimization. While this could be done cur- 
rently on serial processing computers (of the super 
computer variety such as those manufactured by Cray 
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and Cyber) the parallel processing approach will per- 
form the molecular dynamics and energy minimization 
much faster and cheaper than using the super computing 
approach* 

In particular, hypercube computers exist which 
have inexpensive computing nodes which compare very 
favorably to supercomputers for scalar arithmetic. 
Molecular dynamics and energy minimization are only 
partly vector izable because the puLeutial function* 
used have numerous data-dependent branches. 

VI. Preparation and Expressio n of Genetic 
Sequences, and Uses * 

The polypeptide sequences generated by the methods 
described herein, give rise by application of the gen- 
etic code, to genetic sequences coding therefor. Giv- 
en the degeneracy of the code, however, there are in 
many instances multiple possible codons for any one 
amino acid. Therefore, codon usage rifles, which are 
also well understood by those of skill xn the art, can 
be utilized for the preparation of eptimtMd genetic 
sequences for coding in any desired organism. (See, 
for example, Ikemura, J. Mol. Biol . 151t389-40 9 
(1981)>. 

Generally, it is possible to utilize the cDNA se- 
quences obtained from the light and heavy chains of 
the variable region of the original antibody as a 
starting point. These sequences can then be joined by 
means of genetic linkers coding for the peptide linker 
candidates elucidated by the methods of the invention. 
The genetic sequence can be entirely synthesized de 
novo or fragments of cDNA can be linked together with 
the synthetic linkers, as described. 
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A large source of hybridomas and their correspond- 
ing monoclonal antibodies are available for the pre- 
paration of sequences coding for the H and L chains of 
the variable region. As indicated previously, it is 
well known that most -variable" regions of antibodies 
of a given class are in fact quite constant in their 
three dimensional folding pattern, except for certain 
specific hypervarlaoie looys. Thu-, in order to 
choose and determine the specific binding specific- 
ity of the single chain binding protein of the inven- 
tion it becomes necessary only to define the protein 
sequence (and thus the underlying genetic sequence) of 
the hypervariable region. The hypervariable region 
will vary from binding molecule to molecule, but the 
remaining domains of the variable region will remain 
constant for a given class of antibody. 

.Source mRNA can be obtained from a wide range of 
hybridomas. See for exampfe the catalogue ATCC Cell ,, 
Lines and Hybridomas , Decemoer 1334, Ascrisi" *^yp« 
Culture Collection, 20309 Parklawn Drive, Rockville, 
Maryland 20852, U.S.A., at pages 5-9. Hybridomas se- 
creting monoclonal antibodies reactive with a wide 
variety of antigens are listed therein, are available 
from the collection, and usable in the invention. Of 
particular interest are hybridomas secreting antibod- 
ies which are reactive with viral antigens, tumor as- 
sociated antigens, lymphocyte antigens, and the like. 
These cell lines and others of similar nature can be 
utilized to copy mRNA coding for the variable region 
or determine amino acid sequence from the monoclonal 
antibody itself. The specificity of the antibody to 
be engineered will be determined by the original se- 
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lection process. The class of antibody can be deter- 
mined by criteria known to those skilled in the art. 
If the class is on* for which there is a three-dimen- 
sional structure, one needs only to replace the se- 
quences of the hyper-variable regions (or complemen- 
tary determining regions). The replacement sequences 
will be derived from either the amino acid sequence or 
the =ncl-o*-ide sequence of DNA copies of the mRNA. 

It is to be specifically noted that it is not ne- 
cessary to crystallise and determine the 3-D struc- 
ture of each variable region prior to applying the 
method of the invention. As only the hypervariable 
loops change drastically from variable region to vari- 
able region (the remainder being constant in the 3-D 
structure of the variable region of antibodies of a 
given class), it is possible to generate many single 
chain 3-D structures from structures already known or 
to be determined^Eor each class of antibody. 

Tor example, lingers generated in the Examples in 
this application (e.g. , TRY40, TRY61 or TRY59, see 
below) are for Fv regions of antibodies of the IgA 
class. They can be used universally for any antibody, 
having any desired specificity, especially if the 
antibody is of the IgA class. 

Expression vehicles for production of the mole- 
cules of the invention include plasmids or other vec- 
tors. In general, such vectors containing replicon 
and control sequences which are derived from species 
compatible with a host cell are used in connection 
with the host. The vector ordinarily carries a repli- 
con site, as well as specific genes which are capable 
of providing phenotypic selection in transformed 
cells. For example, E. coli is readily transformed 
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using pBR322, a plasmid derived from an E. coli spe- 
cies. pBR322 contains genes for ampicillin and tetra- 
cycline- resistance, and thus provides easy means for 
identifying transformed cells. The pBR322 plasmid or 
other microbial plasmids must also contain, or be mod- 
ified to contain, promoters which can be used by the 
microbial organism for expression of its own proteins. 
Those promoters most commonly used i" recombinant DNA 
construction include the beta lactamase, lactose pro- 
moter systems, lambda phage promoters, and the trypto- 
phan promoter systems. While these are the most com- 
monly used, other microbial promoters have been dis- 
covered and can be utilised. 

For example, a genetic construct for a single 
chain binding protein can be placed under the control 
of the leftward promoter of bacteriophage lambda. 
This promoter is one of the strongest known promoters 
which* can be controlled. Control is exerted by the 
lambda repressor, and aajacenL restriction *ites are 
known. 

The expression of the single chain antibody can 
also be placed under control of other regulatory se- 
quences which may be homologous to the organism in its 
untransformed state. For example, lactose dependent 
E. coli chromosomal DNA comprises a lactose or lac 
operon which mediates lactose utilization by elabora- 
ting the enzyme beta-galactosidase. The lac control 
elements may be obtained from bacteriophage lambda 
plac5, which is infective for B. coli . The lac promo- 
ter-operator system can be induced by IPTG. 

Other promoter/operator systems or portions there- 
of can be employed as well. For example, colicin El, 
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galactose, alkaline phosphatase, tryptophan, xylose, 
tac, and the like can be used. 

Of particular interest is the use of the 0 L /P R 
hybrid lambda promoter (see for example U.S. patent 
application Serial Uumber 534,982 filed September 3, 
1983, and herein incorporated by reference!. 

0th*r preferred hosts are mammalian cells, grown 
in vitro in tissue culture, or in vivo in «ui=^l=. 
Mammalian cells provide post' translational modifica- 
tions to immunoglobulin protein molecules including 
correct folding or glycosylation at correct sites. 

Mammalian cells which may be useful as hosts in- 
clude cells of fibroblast origin such as VERO or 
CH0-K1, or cells of lymphoid origin, such as the hy- 
bridoma SP2/0-AG14 or the myeloma P3x63Sg8, and their 
derivatives. 

Several possible vector systems are available for 
the expression of cloned single chain binding proteins 
in mammalian cells. One class of vectors utili2es imh 
elements which provide autonomously replicating extra- 
chromosomal plasmids, derived from animal viruses such 
as bovine papilloma virus, polyoma virus, or SV40 vir- 
us. A .second class of vectors relies upon the inte- 
gration of the desired gene sequences into the host 
cell chromosome. Cells which have stably integrated 
the introduced DNA into their chromosomes can be se- 
lected by also introducing drug resistance genes such 
as E. coli GPT or Tn5neo. The selectable marker gene 
can either be directly linked to the DNA gene sequen- 
ces to be expressed, or introduced into the same cell 
by co-transfection. Additional elements may also be 
needed for optimal synthesis of single chain binding 
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protein mRNA. These elements may include splice sig- 
nals, as well *s transcription promoters, enhancers, 
and termination signals. cDNA expression vectors in- 
corporating such elements include those described by 
Okayama, H., Mol . Cel. Biol., 3:280 (1983), and 
others. 

Another ^afiarred i. Yeast. Yeast provides 

substantial advantages in that it can also carry our 
pest translational peptide modifications including 
glycosylate. A number of recombinant DNA strategies 
exist which utilize strong promoter sequences and high 
copy number of plasmids which can be utilized for pro- 
duction of the desired proteins in yeast. Yeast re- 
cognizes leader sequences on cloned mammalian gene 
products, and secretes peptides bearing leader sequen- 
ces (i.e., pre-peptides>. 

Any of a series of yeast gene expression systems 
incorporating propter and termination elements from 
the actively expressed genes coding for glycolytic 
enzymes produced in large quantities when yeasts are 
grown in mediums rich in glucose can be utilized. 
Known glycolytic genes can also provide very efficient 
transcription control signals. For example, the pro- 
pter and terminator signals of the phosphoglycerate 
kinase gene can be utilized. 

Once the strain carrying the single chain building 
molecule gene has been constructed, the same can also 
be subjected to mutagenesis techniques using, chemical 
agents or radiation, as is well known in the art. 
From the colonies thus obtained, it is possible to 
search for those producing binding molecules with in- 
creased binding affinity. In fact, if the first lin- 
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Jeer designed with the aid of the computer fails to 
produce an active molecule, the host strain containing 
the same can be mutagenized. Mutant molecules capable 
o£ binding antigen can then be screened by means of a 

routine assay. 

The expressed and refolded single chain binding 
proteins of the invention can be labelled with detect- 
„M£ snc h as radioactive atoms, enzymes, bio- 

tin/avidin labels, chromophores, chemiluminescent 
labels, and the like for carrying out standard immuno- 
diagnostic procedures. These procedures include com- 
petitive and immunometric (or sandwich) assays. These 
assays can be utilized for the detection of antigens 
in diagnostic samples. In competitive and/or sandwich 
assays, the binding proteins of the invention can also 
be immobilized on - such insoluble solid phases as 
beads, test tubes, or other polymeric materials. 

For imaging procedures, the binding molecules of 
tne invention c*= be labelled with opacifying agents, 
such as NMR contrasting agents or X-ray contrasting 
agents. Methods of binding, labelling or imaging 
agents to proteins as well as binding the proteins to 
insoluble solid phases are well known in the art. The 
refolded protein can also be used for therapy when 
labelled or coupled to enzymes or toxins, and for 
purification of products, especially those produced by 
the biotechnology industry. The proteins can also be 
used in biosensors. 

Having now generally described this invention the 
same will be better understood by reference to certain 
specific examples which are included for purposes of 
illustration and are not intended to be limiting un- 
less otherwise specified. 
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EX AMPLE S 

In these experiments, the basic Fv 3-D structure 
used for the computer assisted design was that of the 
anti-phosphoryl choline myeloma antibody of the IgA 
class, MCPC-603. The X-ray structure of this antibody 
is publicly available from the Brookhaven data base. 

The starting material for these examples ««« 
monoclonal antibody cell line 3C2 which produced a 
mouse anti-bovine growth hormone (BOB). This antibody 
is an IgG 1 with a gamma 1 heavy chain and kajepa light 
chain. cDNA's for the heavy and light chain sequences 
were cloned and the DHA sequence determined. The nu- 
cleotide sequences and the translation of these se- 
quences for the mature heavy and mature light chains 
are shown in Figures 21 and 22 respectively. 

Plasmids which contain just the variable region of 
the heavy and light chain sequences were prepared. A 
Clal site and an ATG initiation codon iatCGATG) vsrs 
introduced before the first codon of the mature se- 
quences by site directed mutagenesis. A HindlH site 
and termination codon ( TAAGCTT ) were introduced after 
the codon 123 of the heavy chain and the codon 109 of 
the light chain. The plasmid containing the V Q se- 
quences is pGX3772 and that containing the V L is 
pGX3773 (Figure 23). 

The examples below were constructed and produced 
by methods known to those skilled in the art. 
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EXAMPLE 1 
A. Computer Design 

A two-linker example (referred to as TRY 40) was 
designed by the following steps. 

First, it was observed that light chains were much 
easier to make in E. coll than were heavy chains. It 
was thus decided to start with light chain, (In the 
future, one could certainly make examples which begin 
with heavy chain because there is a very similar con- 
tact between a turn in the heavy chain and the exit 
strand of the light chain,) 

Refer to stereo Figure 30A, which shows the light 
and heavy domains of the Fv from MOPC-603 antibody? 
the constant domains are discarded. A line joining 
the alpha carbons of the light chain is above and 
dashed. The amino terminus of the light chain is to 
the back and at about 10 o'clock from the picture 
center and is labeled "N. fc At the right edge the 
picture.- at about 2 o'clock is an arrow showing the 
path toward the constant domain. Below the lignt. 
chain is a line joining the alpha carbons of the heavy 
chain. The amino terminus of the heavy chain is 
toward the viewer at about 7 o'clock and is also 
labeled "N." At about 4:30, one sees an arrow showing 
the heavy chain path to its constant domain. 

The antigen-binding site is to the left, about 9 
o'clock and between the two loops which project to the 
right above (linht chain) and below (heavy chain). 

In addition to the alpha carbon traces, there are 
three segments in which all non-hydrogen atoms have 
been drawn. These strands are roughly parallel and 
from upper right to lower left. They are 
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(a) Proline 46 to Proline 50 of the light chain. 

(b) Valine 111 to Glycine 113 of the heavy chain. 

(c) Glutamic acid 1 to glycine 10 of the heavy 
chain. 

The contact between tryptophan 112 of the heavy 
chain and proline 50 of the light chain seems very 
favorable. Thus it was decided that these two resi- 
dues should b* conserved. Several linkers were sought 
and found which would join a residue at or following 
Tryptophan 112 (heavy) to a residue at or following 
proline 50 (light). Stereo figure 30B shows the re- 
gion around TRP 112H in more detail. The letter "r" 
stands between the side-chain of TRP 112H and PRO 50L; 
it was wished to conserve this contact. The letter 
•q" labels the carboxy terminal strand which leads 
towards the constant domain. It is from this strand 
that a linker will be found which will connect to PRO 
50L. 

Once a linker 1 » selected to connect 112H to 50L, 
one needs a linker to get from the first segment of 
the light chain into the beginning portion of the 
heavy chain. Note that PRO 46L turns the chain toward 
PRO 50L. This turning seemed very useful, so it was 
decided to keep PRO 46L. Thus the second linker bad 
to begin after 46L and before SOL, in the stretch 
marked "a." A search for linkers was done beginning 
on any of the residues 46L, 47L, or 48L. Linkers be- 
ginning on residue 49L were not considered because the 
chain has already turned toward 50L and away from the 
amino terminal of the heavy chain. Linkers were 
sought which ended on any of the residues 1H to 10H. 
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Figure 30C shows the linked structure in detail. 
After TRP 112H and GLY 113H, was introduced the se- 
quence PRO-GLY-SER, and then comes PRO 50L. A com- 
puter program was used to look for short contacts be- 
tween atoms in the linker and atoms in the retained 
part of the Fv. There is one short contact between 
the beta carbon of the SER and PRO 50L r but small 
movements would relieve th-*. This first linker runs 
from the point labeled "x" to the point labeled "y." 
The second linker runs from "v" to "w." Note that 
most of the hydrophobic residues (ILE and VAL) are 
inside. There is a PEE on the outside. In addition, 
the two lysine residues and the asparagine residue are 
exposed to solvent as they ought to be. Figure 30D 
shows the overall molecule linked into a single chain. 

B. Genetic Constructs 

These constructs were prepared and the plasmida 
containing them using E. col% . hosts. Once construc- 
ted, the sequences can be inserted into whichever ex- 
pression vehicle used in the organism of choice. 

The first construction was TRY40 (the two-linker 
construction) which produces a protein with the fol- 
lowing sequence: 

Met-[L-chain 1-41] -Ile-Ala-Lys-Ala-Phe-Lys-Asn-[H- 
chain 8-105 ]-Pro-Gly-Ser-[L-chain 45-1091. The nucle- 
otide sequence and its translation are seen in Figure 
24. The hypervariable regions in TRY40 (as in TRY61 
59 and 104B, see below) correspond, as indicated, to 
an IgGl anti BGH antibody, even though the 3-D 
analysis was done on the Fv region of MCPC-603 anti- 
body, having a different specificity, (anti phosphoryl 
choline) but having a similar framework in the vari- 
able region. 
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The antibody sequences in the plasmids pGX3772 and 
pGX3773 were joined to give the. sequence of TRY40 in 
the following manner. The plasmids used contained an 
M13 bacteriophage origin of DNA replication. When 
hosts containing these plasmids are superinf ected with 
bacteriophage M13 two types of progeny are produced, 
one containing the single-strand genome and the other 
containing a stifle circular single-strand of the 
plasaid DNA. This DNA provided template for the oli- 
gonucleotide directed site specific mutagenesis ex- 
periments that follow. Template DNA was prepared from 
the two plasmids.' An EcoRI site was introduced before 
codon 8 of the V fl sequence in pGX3772, by site direct- 
ed mutagenesis, producing pGX3772'. Template from 
this construction was prepared and an Xbal site was 
introduced after codon 105 of the V fl sequence produc- 
ing pGX3772' ' . 

An EcoRI $nd an Xba l site were introduced into 
PGX3773 uet™= codon? 41 and 45 of the V L sequence by 
site directed mutagenesis producing pGX3773*. 

To begin the assembly of the linker sequences 
plasmid pGX3773' <V L > DNA was cleaved with EcoRI and 
Xba l and treated with calf alJcaline phosphatase. This 
DNA was ligated to the EcoRI to Xbal fragment purified 
from plasmid pGX3772"CV H > which had been cleaved with 
the two restriction enzymes. The resulting plasmid 
pGX3774, contained the light and heavy chain sequences 
in the correct order linked by the EcoRI and Xbal re- 
striction sites. To insert the correct linker sequen- 
ces in frame, pGX3774 template DNA was prepared. The 
EcoRI junction was removed and the linker coding for 
the -Ile-Ala-Lys-Ala-Phe-Lys-Asn- inserted by site 
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directed mutagenesis, producing plasmid pGX3774' . 
Template DMA was prepared from this construction and 
the Xbal site corrected and the linker coding for 
-Pro-Gly-Ser- inserted by site directed mutagenesis 
producing plasmid pGX3775. The sequence was found to 
be correct as listed in Figure 24 by UNA sequencing. 

In order to express the single-chain polypeptide, 
the sequence as a Clal to Hinoiil fxaysent ^2 insert- 
ed into a vector pGX3703. This placed the sequence 
under the control of the 0 L /P R hybrid lambda promoter 
(U.S. Patent Application 534,982, Sept. 23, 1983). 
The expression plasmid is pGX3776 (Figure 25). The 
plasmid pGX3776 was transformed into a host containing 
a heat sensitive lambda phage repressor; when grown at 
30°C the synthesis of the TRY40 protein is repressed. 
Synthesis was induced by raising the temperature to 
42°C, and incubating for 8-16 hours. The protein was 
produced at 7.2% of total cell protein, as estimated 
on polyacrylamide gel electropherograms stained with 
Coomassie blue. 

EXAMPLE 2 
A. Computer Design 

A one-linker example (referred to as TRY 61) was 
designed by the following steps. 

Refer to stereo Figure 31A which shows the light 
and heavy domains of the Fv; the constant domains are 
discarded. A line joining the alpha carbons of the 
light chain is dashed. The amino terminus of the 
light chain is to the back and at about the center of 
the picture and is labeled "N." At the right edge of 
the picture, at about 2 o'clock is an arrow showing 
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the path toward the constant domain of the light 
chain. Below the light chain is a line joining the 
alpha carbons of the heavy chain. The amino terminus 
of the heavy chain is toward the viewer at about 9 
o'clock and is also labeled "N-. At about 4:30, one 
sees an arrow showing the heavy chain path to its con- 
stant domain. 

In addition to the alpha carbon Lidcss, thsre *r* 
two segments in which all non-hydrogen atoms have been 
drawn. These segments are the last few residues in 
the light chain and the first ten in the heavy chain. 
Linkers were sought between all pairs of these resi- 
dues, but only a few were found because these regions 
are widely separated. 

Figure 31B shows the linker in place. Note that 
the molecule now proceeds from the amino terminal of 
the light chain to the carboxy terminal strand of the • 
he?try chain. Note also that the antigen-binding re- 
gion is to the left, on the other side or the molecule 
from the linker. 

B. Genetic Constructs 

The sequence of TRY61 (a single-linker embodi- 
ment) is Met-[L-chain 1-1043-Val-Arg-Gly-Ser-Pro-Ala- 
Ile-Asn-Val-Ala-Val-His-Val-Pbe-[H-chain 7-1231. The 
nucleotide sequence and its translation are shown in 
Pigure 26. 

To construct TRY61, plasmid pGX3772' DNA was 
cleaved with Clal and EcoRI and treated with calf al- 
kaline phosphatase. This DNA was ligated with the 
Clai to Hind lll fragment from pGX3773 and two oligo- 
nucleotides which code for the linker sequence and 
have Hindlll and EcoRI ends, so that the linker can 
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only be ligated in the correct orientation. The re- 
sulting plasmid, pGX3777, was used to prepare template 
DNA. This DNA was used for site directed mutagenesis 
to remove the Hindlll site inside the antibody sequen- 
ces. The correct construction, pGX3777» , was used to 
make template DNA for a site directed mutagenesis to 
remove the EcoRI site. The Clal to Hindlll fragment 
£ Iufii the final construction, pGX3 7 7 8, containing the 
TRm coding sequence was confirmed by DNA sequencing. 
The Clal to Hind lll was inserted into the pGX3703 ex- 
pression vector. This plasmid is called pGX4904 (Fig- 
ure 27>. This plasmid was transformed into an E. coli 
host. The strain containing this plasmid has been 
induced, and the single chain protein produced as >2% 
of total cell protein. 

EXAMPLE 3 
A. Computer Design 

A one-linker exaaple f referred to as TRY 59) was 
designed by the following steps. 

Refer to stereo Figure 32A which shows the light 
and heavy domains of the Fv; the constant domains are 
discarded. A line joining the alpha carbons of the 
light chain is above and dashed. The amino terminus 
of the light chain is to the back and at about 10 
o'clock from the center of the picture and is labeled 
"N". At the right edge of the picture, at about 2 
o'clock is an arrow showing the path toward the con- 
stant domain of the light chain. Below the light 
chain is a line joining the alpha carbons of the heavy 
chain. The amino terminus of the heavy chain is to- 
ward the viewer at about 8 o'clock and is also labeled 
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•N-. At about 4:30, one sees an arrow showing the 
heavy chain path to its constant domain. 

In addition to the alpha carbon traces, there are 
two segments in which all non-hydrogen atoms have been 

drawn. Th gments are the last few residues in 

the light chain and the first ten in the heavy chain. 
Linkers we sought between all pairs of these residues, 
but only a few wra fasfid because these regions are 
widely separated. 

Figure 32B shows the linker in place. Note that 
the molecule now proceeds from the amino terminal of 
the light chain to the carboxy terminal strand of the 
heavy chain. Note also that the antigen- binding re- 
gion is to the left, on the other side of the molecule 
from the linker. 

The choice of end points in TRY59 is very similar 
to TRY61. Linkers of this length are rare. The ten- 
sion between wanting short linkers that fit very well 
and which could be found for the two-linker case 
(TRY40) and the desdre to have only one linker, (which 
is more likely to fold correctly) is evident in the 
acceptance of TOYS 9. The linker runs from the point 
marked "A" in Figure 32B to the point marked "J." 
After five residues, the linker becomes helical. At 
the point marked -x,« however, the side-chain of an 
ILE residue collides with part of the light chain. 
Accordingly, that residue was converted to GLY in the 
actual construction. 

B. Genetic Constructs 

The sequence of TRY59 (the single linker construc- 
tion) is Met-[L-chain 1-105 ]-Lys-Glu-Ser-Gly-Ser-Val- 
Ser-Ser-Glu-Gln-Leu-Ala-Gln-Phe-Arg-Ser-Leu-Asp-[H- 
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chain 2-123]. The nucleotide sequence coding for this 
amino acid sequence and its translation is shown in 
Figure 28. The Bgll to Hindlll fragment (read clock- 
wise) from plasmid pGX3773 containing the V L sequence 
and the Clal to Bgll fragment (clockwise) from pGX3772 
has been ligated with two oligonucleotides which form 
a fragment containing the linker sequence for TRY59 
and have Clal and Hindlll ends. The Clal and Hindlll 
junctions within this plasmid are cot^ctsd by t«<? 
successive site directed mutageneses to yield the cor- 
rect construction. The Clal to Hindi" fragment from 
this plasmid is inserted into the 0 L /P R expression 
vector as in Examples 1 and 2. The resulting plas- 
mid, pGX4908 (Figure 29) is transformed into an E^ 
coll host. This strain is induced to produce the pro- 
tein coded by the sequence in Figure 28 (TRY59). 

Example 4 

A. Computer Design 

In this design an alternative method of choosing a 
linker to connect the light and heavy variable regions 
was used. A helical segment from human hemoglobin was 
chosen to span the major distance between the carboxy 
terminus of the variable light chain and the amino 
terminus of the variable heavy chain. This alpha 
helix from human hemoglobin was positioned at the rear 
of the F model using the computer graphics system. 
Care was V taken to position the helix with its ends 
near the respective amino and carboxyl termini of the 
heavy and light chains. Care was also taken to place 
hydrophobic side chains in toward__the F y and hydro- 
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philic side chains toward the solvent. The connec- 
tions between the ends of the variable regions and the 
hemoglobin helix were selected by the previously 
described computer method (EXAMPLE 1-3) . 

B. Genetic Constructs 

n*. **ouence of TRYl04b (a single linker construc- 
tion) is Met-[L-chain l-106]-Ala-Glu-Gly-rar-[ (Hemo- 
globin helix) Leu-Ser-Pro-Ala-Asp-Lys-Thr-Asn-Val-Lys- 
Ala-Ala-Trp-Gly-Lys-Val-]Met-Thr-[H-chain 3-1231. The 
nucleotide sequence coding for this amino acid 
sequence and its translation is shown in Figure 33. 
The Bgll to Hindlll fragment (read clockwise) from 
plasmid pGX3773 containing the V L sequence and the 
Ciai to Bgll fragment (clockwise) from pGX3772 has 
been ligated with two oligonucleotides which form a 
fragment containing the linker sequence for TRY^Hb 
and h«vc Clal and Findlll ends. The Clal and Hindlll 
junctions withih this plasmid are corrected by two 
successive site directed mutageneses to yield the 
correct construction. The Clal to Hindlll fragment 
from this plasmid is inserted into the 0 L /P R expres- 
sion vector as in Examples 1-3. The resulting plas- 
mid, PGX4910 (Figure 34) is transformed into an ^_ 
coli host. This strain is induced to produce 'the pro- 
tein coded by the sequence in Figure 33 (TRYl04b). 

EXAMPLE 5 
Purification of t he Proteins 

The single-chain antigen binding proteins from 
TRY40, TRY61, TRY59 and TRY104b are insoluble, and 
cells induced to produce these proteins show refrac- 
tile bodies called inclusions upon microscopic exami- 
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nation. Induced cells were collected by centrifuga- 
tion. The wet pellet was frozen on dry ice, then 
stored at -20°C. The frozen pellet was Suspended in a 
buffer and washed in the same buffer, and subsequently 
the cells were suspended in the same buffer. The 
cells were broken by passage through a French pressure 
cell, ana the iaclusier. bodies containing the single- 
chain antigen binding protein (SCA) were purified by 
repeated centrifugation and washing. The pellet was 
solubilized in guanidine-HCl, and reduced with 
2-mercaptoethanol. The solubilized material was 
passed through a gel filtration column, i.e., 
Sephacryl™ S-300. Other methods such as ion exchange 
could be used. 

EXAMPLE 6 
Folding of the Proteins 

Purified material was dislys** against water, and 
the precipitate protein collected by centrifugation. 
The protein was solubilized in urea and reduced with 
2-mercaptoethanol. This denatured and solubilized 
material was dialyzed against a buffer containing salt 
and reducing agents to establish the redox potential 
to form the intra domain (one each for the light and 
heavy chain variable region sequences) disulfide 
bridges (Saxena and Wetlanfer, Biochem 9:5015-5023 
(1970)). The folded protein was assayed for BGH bind- 
ing activity. 

The TRY59 protein used in competition experiments 
was solubilized and renatured directly from inclu- 
sions. This material was subsequently purified by 
affinity to BGH-Sepharose. 
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1IB OAIM: 

1. a single polypeptide chain binding molecule 
which has binding specificity substantially similar to 
the binding specificity of the light and heavy chain 
aggregate variable region of an antibody. 

2. The molecule of claim 1 which comprises two 
peptide linkers joining said light and heavy chains 
into said single chain. 

3. The molecule of claim 2 which comprises in 
sequence: 

(a) an N-terminal region derived from said 
light chain; 

(b) a peptide linker; 

.5 (c) a peptide region derived from said heavy 

chain; 

(d) a second peptide linker; and 

(e) a C-terminal region derived from said 
light chain. 

4. The molecule of claim 1 which comprises one 
peptide linker joining said light and heavy chains 
into said single chain. 

5. The molecule of claim 4 which comprises, in 
sequence: 

(a) an N-terminal region derived from said 
light chain; 

(b) a peptide linker; and 

(c) a C-tenninal region derived from said 
heavy chain. 
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6. The molecule of claim 4 which comprises in 
sequence: 

(a) an H-terminal region derived from said 
heavy chain; 

(b) a peptide linker; and 

(c) a C-terminal region derived from said 
light chain. 

I. The molecule of claim 3, 5 or 6 which r prior 
to said N-terminal region Ca) , comprises a methionine 
residue. 

8. The molecule of claim 1 which is detectably 
labeled. 

9. The molecule of .claim .L. which is -in immobil- 
ized form. 

10. The molecule of claim 1 wuich is conjugated 
to an imaging agent. 

II. The molecule of claim 1 which is conjugated 
to a toxin. 

12. A genetic sequence coding for the molecule of 
claim 1. 

13. A recombinant DNA (rDNA) molecule comprising 
the sequence of claim 12. 

14. The rDNA molecule of claim 13 which is a rep- 
licable cloning or expression vehicle. 
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15. The rDNA molecule of claim 14 wherein said 
vehicle is a plasmid. 

16. A host cell transformed with the rDNA mole- 
cule of claim 13. 

17. ' The host cell of claim 16 which is a bacter- 
ial cell, a yeast or other fungal ceil u£ a aassslUn 
cell line in vitro . 

18. A method of producing a single polypeptide 
chain binding molecule which has binding specificity 
substantially similar to the binding specificity of 
the light and heavy chain aggregate variable region of 
an antibody, which comprises! 

(a) providing a genetic sequence coding for 
said molecule; 

(b) transforming a host cell with said se- 
quence ; 

(c) expressing said sequence in said host; 
and 

(d) recovering said molecule. 

19. The method of claim 18 which further 
comprises purifying said recovered molecule. 

20. The method of claim 18 wherein said host cell 
is a bacterial cell, yeast or other fungal cell, or a 
mammalian cell line. 

21. The binding molecule produced by the method 
of claim 18 or 19. 
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22. In an immunoassay method which utilizes an 
antibody in labeled form, the improvement comprising 
using the molecule of ■ claim 8 instead of said anti- 
body. 

23. In an immunoassay method which utilizes an 
antibody in immobilized form, the improvement compris- 
ing using the molecule of claim 3 instead of said an- 
tibody. 

24. In the immunoassay of claim 21 or 22 wherein 
said immunoassay is a competitive immunoassay. 

25. In the immunoassay of claim 21 or 22 wherein 
said immunoassay is a sandwich immunoassay. 

26. In an immunotherapeutic method which utilizes • 
2= antibody conjugated to a therapeutic agent, the 
improvement comprising using the molecule cf claim 1 
instead of said antibody. 

27. In a method of immunoaff inity purification 
which utilizes an antibody therefor, the improvement 
which comprises using the molecule of claim 1 instead 
of said antibody. 
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